Showing entries 1 to 6
Displaying posts with tag: outage (reset)
How to create a rock-solid MySQL database backup & recovery strategy

Have you ever wondered what could happen if your MySQL database goes down?

Although it’s evident such a crash will cause downtime – and surely some business impact in terms of revenue – can you do something to reduce this impact?

The simple answer is “yes” by doing regular backups (of course) but are you 100% sure that your current backup strategy will really come through when an outage occurs? And how much precious time will pass (and how much revenue will be lost) before you get your business back online?

I usually think of backups as the step after HA fails. Let’s say we’re in M<>M replication and something occurs that kills the db but the HA can’t save the day. Let’s pretend that the UPS fails and those servers are completely out. You can’t failover; you have to restore data. Backups are a key piece of “Business Continuity.” Also factor in the frequent need to restore data that’s been …

[Read more]
AirBNB didn’t have to fail

Read the original article at AirBNB didn’t have to fail

Today part of Amazon Web Services failed, taking down with it a slew of startups that all run on Amazon’s Cloud infrastructure. AirBNB was one of the biggest, but also Heroku, Reddit, Minecraft, Flipboard & Coursera down with it. Its not the first time. What the heck happened, and why should we care?

1. Root Cause

The AWS service allows companies like AirBNB to build web applications, and host them on servers owned and managed by Amazon. The so-called raw iron of this army of compute power sits in datacenters. Each datacenter is …

[Read more]
ANALYZE TABLE is replicated. RTFM.

Sometimes, I make mistakes. It’s true. That can be difficult for us Systems Engineering-types to say, but I try to distance myself from my ego and embrace the mistakes because I often learn the most from them. ..Blah, blah, school of hard knocks, blah, blah…. Usually my mistakes aren’t big enough to cause any visible impact, but this one took the site out for 10 minutes during a period of peak traffic due to a confluence of events.

Doh!

Here is how it went down…

We have an issue where MySQL table statistics are occasionally getting out of whack, usually after a batch operation. This causes bad explain plans, which in turn cause impossibly slow queries. An ANALYZE TABLE (or even SHOW CREATE INDEX) resolves the issue immediately, but I prefer not get woken up at 4AM by long running query alerts when my family and I are trying to sleep. As a way to work around the issue, we decided to disable InnoDB automatic …

[Read more]
Today’s up-time requirements

When asking about up-time requirements set down in SLAs (Service Level Agreements) with our clients’ clients, we’d hear anything ranging from hours to the familiar five nines, but these days also simply 100% and otherwise penalties apply. From my perspective, there’s not much difference between five nines and 100%, 99.999% uptime over a year amounts to a maximum of little over 5 minutes outage. In many cases, this includes scheduled outages!

So, we can just not have any outages, scheduled or otherwise. Emergency support is not going to help here, because however fast and good they are, you’re already in serious penalty time or well on your way to not having a business any more. Most will respond within say 30 minutes but then need up to a few hours to resolve the issue. That won’t help you, really, will it? And in any case, how are you going to do your maintenance? The answer is, you need to architect things differently.

[Read more]
Blog outage

Sorry for a short outage today – we were moving to a new server we had some problems because of software incompatibilities on the new box. Now all sites on this box should behave as usual


S3 suffers major outage

“Funny how Amazon doesn't use S3 to store any assets for amazon.com”tweet by @gruber

Amazon's S3 suffered a major outage today knocking many websites offline. S3 outage started at approximately 12:00 PM EST and the last time I checked at 11:11PM EST, Smugmug, a popular photo hosting site that extensively uses S3, was still down.

- S3 down for more than 7 hours
- S3 outage, 7 hours and counting
- S3 down again
- Amazon failure downs …

[Read more]
Showing entries 1 to 6