Showing entries 1 to 8
Displaying posts with tag: outage (reset)
How Securing MySQL with TCP Wrappers Can Cause an Outage

The Case

Securing MySQL is always a challenge. There are general best practices that can be followed for securing your installation, but the more complex setup you have the more likely you are to face some issues which can be difficult to troubleshoot.

We’ve recently been working on a case (thanks Alok Pathak and Janos Ruszo for their major contribution to this case) where MySQL started becoming unavailable when threads activity was high, going beyond a threshold, but not always the same one.

During that time there were many logs like the following, and mysqld was becoming unresponsive for a few seconds.

2019-11-27T10:26:03.476282Z 7736563 [Note] Got an error writing communication packets
2019-11-27T10:26:03.476305Z 7736564 [Note] Got an error writing …
[Read more]
MySQL, –i-am-a-dummy!

In this blog post, we’ll look at how “operator error” can cause serious problems (like the one we saw last week with AWS), and how to avoid them in MySQL using

--i-am-a-dummy

.

Recently, AWS had some serious downtime in their East region, which they explained as the consequence of a bad deployment. It seems like most of the Internet was affected in one way or another. Some on Twitter dubbed it “S3 Dependency Awareness Day.”

Since the outage, many companies (especially Amazon!) are reviewing their production access and deployment procedures. It would be a lie if I claimed I’ve never made a mistake in production. In fact, I would be afraid of working with someone who claims to have never made a mistake in a production environment.

Making a mistake or two is how you learn to have a full sense …

[Read more]
How to create a rock-solid MySQL database backup & recovery strategy

Have you ever wondered what could happen if your MySQL database goes down?

Although it’s evident such a crash will cause downtime – and surely some business impact in terms of revenue – can you do something to reduce this impact?

The simple answer is “yes” by doing regular backups (of course) but are you 100% sure that your current backup strategy will really come through when an outage occurs? And how much precious time will pass (and how much revenue will be lost) before you get your business back online?

I usually think of backups as the step after HA fails. Let’s say we’re in M<>M replication and something occurs that kills the db but the HA can’t save the day. Let’s pretend that the UPS fails and those servers are completely out. You can’t failover; you have to restore data. Backups are a key piece of “Business Continuity.” Also factor in the frequent need to restore data that’s been …

[Read more]
AirBNB didn’t have to fail

Read the original article at AirBNB didn’t have to fail

Today part of Amazon Web Services failed, taking down with it a slew of startups that all run on Amazon’s Cloud infrastructure. AirBNB was one of the biggest, but also Heroku, Reddit, Minecraft, Flipboard & Coursera down with it. Its not the first time. What the heck happened, and why should we care?

1. Root Cause

The AWS service allows companies like AirBNB to build web applications, and host them on servers owned and managed by Amazon. The so-called raw iron of this army of compute power sits in datacenters. Each datacenter is …

[Read more]
ANALYZE TABLE is replicated. RTFM.

Sometimes, I make mistakes. It’s true. That can be difficult for us Systems Engineering-types to say, but I try to distance myself from my ego and embrace the mistakes because I often learn the most from them. ..Blah, blah, school of hard knocks, blah, blah…. Usually my mistakes aren’t big enough to cause any visible impact, but this one took the site out for 10 minutes during a period of peak traffic due to a confluence of events.

Doh!

Here is how it went down…

We have an issue where MySQL table statistics are occasionally getting out of whack, usually after a batch operation. This causes bad explain plans, which in turn cause impossibly slow queries. An ANALYZE TABLE (or even SHOW CREATE INDEX) resolves the issue immediately, but I prefer not get woken up at 4AM by long running query alerts when my family and I are trying to sleep. As a way to work around the issue, we decided to disable InnoDB automatic …

[Read more]
Today’s up-time requirements

When asking about up-time requirements set down in SLAs (Service Level Agreements) with our clients’ clients, we’d hear anything ranging from hours to the familiar five nines, but these days also simply 100% and otherwise penalties apply. From my perspective, there’s not much difference between five nines and 100%, 99.999% uptime over a year amounts to a maximum of little over 5 minutes outage. In many cases, this includes scheduled outages!

So, we can just not have any outages, scheduled or otherwise. Emergency support is not going to help here, because however fast and good they are, you’re already in serious penalty time or well on your way to not having a business any more. Most will respond within say 30 minutes but then need up to a few hours to resolve the issue. That won’t help you, really, will it? And in any case, how are you going to do your maintenance? The answer is, you need to architect things differently.

[Read more]
Blog outage

Sorry for a short outage today – we were moving to a new server we had some problems because of software incompatibilities on the new box. Now all sites on this box should behave as usual


S3 suffers major outage

“Funny how Amazon doesn't use S3 to store any assets for amazon.com”tweet by @gruber

Amazon's S3 suffered a major outage today knocking many websites offline. S3 outage started at approximately 12:00 PM EST and the last time I checked at 11:11PM EST, Smugmug, a popular photo hosting site that extensively uses S3, was still down.

- S3 down for more than 7 hours
- S3 outage, 7 hours and counting
- S3 down again
- Amazon failure downs …

[Read more]
Showing entries 1 to 8