Please make it descriptive, graphic, and if anything burnt or
exploded I'd love to have pictures.
Include an approximate timeline of when things happened and when
it was all working again (if ever).
Thanks!
This somewhat relates to the earlier post A
SAN is a single point-of-failure, too. Somehow people get
into scenarios where highly virtualised environments with SANs
get things like replication and everything, but it all runs on
the same hardware and SAN backend. So if this admittedly very
nice hardware fails (and it will!), the degree of "we're stuffed"
is particularly high. The reliance in terms of business processes
is possibly a key factor there, rather than purely technical
issues.
Anyway, if you have good stories of (distributed?) SAN and VM
infra failure, please step up and tell all. It'll help prevent
similar issues for …
Mar
13
2009
Apr
15
2008
- Suicide
- having no backups
- depending on slaves for backup
- keeping backups on same SAN
- having a single DBA - Frank didn't like this one at all
- not keeping binlogs
- Restoring from backup
- how much time?
- uncompressed backup ready to mount?
- separate network for recovery?
- In Fotolog, 1TB of data was severely hit.
- first problem: backup was highly compressed (tar.gz)
- uncompressing took hours
- so keep uncompressed backups (at least last N days)
- it should be mountable, rather than transferable
- Frank going over recovery modes at …