Monitoring tools offer two core types of functionality: alerts based on aliveness checks and comparing metrics to thresholds, and displaying time-series charts of status counters. Nagios + Graphite are the prototypical time-series tools that do these things.
But these tools don’t answer the crucial questions about what we should monitor. What kinds of aliveness/health checks should we build into Nagios? Which metrics should we monitor with thresholds to raise alarms, and what should the thresholds be? What graphs should we build of status counters, which graphs should we examine and what do they mean?
We need guiding principles to help answer …[Read more...]