We just had a booboo in one of our internal systems, causing it to not come up properly on reboot. The actual mishap occurred several weeks ago (simple case of human error) and was in itself a valid change so monitoring didn’t raise any concerns. So, as always, it’s interesting and useful to think about such events and see what we can learn.
Years ago, but for some now still, one objective is to see long uptime for a server, sometimes years. It means the sysadmin is doing everything right, and thus some serious pride is attached to this number. As described only last week in …[Read more...]