Unexpected downtime is one of your worst nightmares, but most
attempts to find problems before they happen are threshold-based.
Thresholds create noise, and alerts create false positives so
often you may miss actual problems.
When we began building VividCortex, we introduced Adaptive Fault
Detection, a feature to detect problems through a combination of
statistical anomaly detection and queueing theory. It’s our
patent-pending technique to detect system stalls
in the database and disk. These are early indicators of serious
problems, so it’s really helpful to find them. (Note: “fault” is
kind of an ambiguous term for some people. In the context we’re
using here, it means a stall/pause/freeze/lockup).
The initial version of fault detection enabled us to find hidden
problems nobody suspected, but as our customer base diversified,
we found more situations that could fool it. We’ve released a new
version that …
[Read more]