Showing entries 1 to 5
Displaying posts with tag: robustness (reset)
How to use "cron" to run periodic scheduled jobs

"cron" is the Unix (Linux, etc) scheduler which runs regularly scheduled jobs. This post is not meant to be the "man" page (it has one of those already), but ideas how to use "cron" in a robust way.

Setting up cron jobsThere are at least *three* ways of configuring cron jobs on a modern Linux system; technically these are extensions, but they're so quasi-standard, they're even (possibly) available on FreeBSD :)

  • Per-user "crontab" file. This can be edited using crontab -e, or replaced by crontab . If you are installing system-level software, you probably don't want to use this. Each user can have only one crontab file.
  • System-wide "crontab" file, usually /etc/crontab. This is usually managed by the distribution / package manager, and you probably don't want to change this; there is only one.
  • Per-package "crontab" files - usually kept in /etc/cron.d. There are multiple files, usually one per …
[Read more]
How to correctly make "latest" symlinks

"Latest symlink"A "latest" symlink, is a symbolic link (on Linux, Unix etc) which links to the "latest" version of a file.
Suppose we have a file which takes some effort to create, which is generated periodically or in response to some stimulus (e.g. user activity). Then we want to create a "latest version" symlink.
Ideally the properties should be

  • latest symlink always points at the latest version (duuh!)
  • latest symlink always exists
  • latest symlink never points at a partially completed, broken, missing or otherwise bad file

Sometimes people do this in a way which won't work.
How to create a symlinkDead easy, right? Just call the "symlink" function. 
 int symlink(const char *oldpath, const char *newpath);

 DESCRIPTION
       symlink()  creates  a  symbolic  …

[Read more]
The most common cause of unavailability

Hi, Happy new year.

I've done a lot of work on high-availability systems. There is a lot of writing on high-availability systems - how to implement failover, hot-spare systems, load-balancers etc.

However, most of these seem to make an assumption: humans are infallible.

In practice, this is not always the case.

In fact, I'd say that probably about 75% of downtime is caused by human errors, cock-ups, mistakes. I'm not an expert, but I suspect that it's about the same proportion as air crashes caused by pilot (or someone else's) error.

So, it's the human, stupid. PBKAC (problem between keyboard and chair).

Here are some possible fixes:

Give human less work to doWe can avoid SOME human errors by having systems automatically configure themselves, setup, or perform sanity checks before accepting settings.
"Blindly accepting" instructions …

[Read more]
When commit appears to fail

So you're using explicit transactions. Everything appears to work (every query gives the expected result) until you get to COMMIT.

Then you get an exception thrown from COMMIT. What happened?

Usually this would be because the server has been shut down, or you've lost the connection.

The problem is, that you can't assume that the commit failed, but you also can't assume it succeeded.

A robust application must make NO ASSUMPTION about whether a failed commit did, indeed, commit the transaction or not. It can safely assume that either all or none of it was committed, but can't easily tell which.

So the only way to really know is to have your application somehow remember that the transaction MIGHT have failed, and check later.

Possible solutions:

  • Ignore it and deal with any inconsistencies manually, or decide that you don't care :)
[Read more]
MySQL running out of disc space

Running out of disc space is not a good situation. However, if it does happen, it would be nice to have some control over what happens.

We use MyISAM. When you run out of disc space, MyISAM just sits there and waits. And waits, and waits, apparently forever, for some space to become available.

This is not good, because an auditing/logging application (which ours is) may have lots of available servers which it could send its data to - getting an error from one would simply mean that the data could be audited elsewhere.

But if the server just hangs, and waits, the application isn't (currently) smart enough to give up and try another server, so it hangs the audit process too. Which means that audit data starts to back up, and customers wonder why they can't see recent data in their reports etc.

There has to be a better way. I propose

  • A background thread monitors the disc …
[Read more]
Showing entries 1 to 5