Showing entries 1 to 3
Displaying posts with tag: emergency (reset)
Running out of disk space on MySQL partition? A quick rescue.

No space left on device – this can happen to anyone. Sooner or later you may face the situation where a database either has already or is only minutes away from running out of disk space. What many people do in such cases, they just start looking for semi-random things to remove – perhaps a backup, a few older log files, or pretty much anything that seems redundant. However this means acting under a lot of stress and without much thinking, so it would be great if there was a possibility to avoid that. Often there is. Or what if there isn’t anything to remove?

While xfs is usually the recommended filesystem for a MySQL data partition on Linux, the extended filesystem family continues to be very popular as it is used as default in all major Linux distributions. There is a feature specific to ext3 and ext4 that can help the goal of resolving the full disk situation.

[Read more]
Today’s up-time requirements

When asking about up-time requirements set down in SLAs (Service Level Agreements) with our clients’ clients, we’d hear anything ranging from hours to the familiar five nines, but these days also simply 100% and otherwise penalties apply. From my perspective, there’s not much difference between five nines and 100%, 99.999% uptime over a year amounts to a maximum of little over 5 minutes outage. In many cases, this includes scheduled outages!

So, we can just not have any outages, scheduled or otherwise. Emergency support is not going to help here, because however fast and good they are, you’re already in serious penalty time or well on your way to not having a business any more. Most will respond within say 30 minutes but then need up to a few hours to resolve the issue. That won’t help you, really, will it? And in any case, how are you going to do your maintenance? The answer is, you need to architect things differently.

[Read more]
Can I have your horror-stories, please? (SANs and VMs)

Please make it descriptive, graphic, and if anything burnt or exploded I'd love to have pictures.
Include an approximate timeline of when things happened and when it was all working again (if ever).
Thanks!

This somewhat relates to the earlier post A SAN is a single point-of-failure, too. Somehow people get into scenarios where highly virtualised environments with SANs get things like replication and everything, but it all runs on the same hardware and SAN backend. So if this admittedly very nice hardware fails (and it will!), the degree of "we're stuffed" is particularly high. The reliance in terms of business processes is possibly a key factor there, rather than purely technical issues.

Anyway, if you have good stories of (distributed?) SAN and VM infra failure, please step up and tell all. It'll help prevent similar issues for …

[Read more]
Showing entries 1 to 3