Showing entries 1 to 1
Displaying posts with tag: probabilistic fault detection (reset)
Services Monitoring with Probabilistic Fault Detection

In this blog post, we’ll discuss services monitoring using probabilistic fault detection.

Let’s admit it, the task of monitoring services is one of the most difficult. It is time-consuming, error-prone and difficult to automate. The usual monitoring approach has been pretty straightforward in the last few years: setup a service like Nagios, or pay money to get a cloud-based monitoring tool. Then choose the metrics you are interested in and set the thresholds. This is a manual process that works when you have a small number of services and servers, and you know exactly how they behave and what you should monitor. These days, we have hundred of servers with thousands of services sending us millions of metrics. That is the first problem: the manual approach to configuration doesn’t work.

That is not the only problem. We know that no two servers perform the same because no two servers have exactly the …

[Read more]
Showing entries 1 to 1