Greg points out a section which he found interesting in the Google Bigtable paper:
For example, we have seen problems due to all of the following causes: memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance.
The partition asymmetry issue should be obvious. If one section of your cluster is too slow you're going to see throughput fall. Hung machines is another. The distributed filesystem I wrote last year (and will eventually opensource once I get time to finish it) handled hung machine situations.
Here are a few more I can think of.
Memory corruption without ECC …
[Read more]