Now it’s time to setup proper monitoring to avoid unpleasant surprises in future.
There are two major problems the monitoring solves: alerting and trending. Alerting is to notify a responsible person about some major event like service stopped working. Trending is to track the change of something over time – disk or memory usage over time, replication lag etc.
This post will be about alerting with Nagios.
The major problem with most of Nagios setups I’ve seen is excessive amount of false positives. This kills whole idea of monitoring. The matter is when an admin gets a false alert they tend to mute it, explicitly or implicitly. They either filter alerts out or don’t treat them seriously. In general case the alert must be [Read more...]