Find the unhealthy machines in your server farm.
A large infrastructure company had thousands of machines in their server farm, which were used to process customer jobs.
They had good monitoring and alerting, but it wasn't possible for a human to watch all the servers. Consequently, they set alerts on system metrics for each of these machines.
However, this was not an ideal solution because subtle behavioral changes of the servers triggered degraded customer experience, yet not be enough to trigger an alert. One way they tried to mitigate this problem was to make the alerting more sensitive, but that just led to alert fatigue. Furthermore, setting alerts required tuning and managing over time, so it was a poor long-term solution.
Thus, they worked with Overseer to help them detect unhealthy machines in their server farm more efficiently. Solving this problem would enable them to scale their business as they increase the number of servers without compromising their commitment to high-quality customer experience.
Overseer worked with the domain experts to identify a core group of metrics that would be a good proxy for the health of a given server. This group of metrics was then used to train Overseer's algorithms to model the behavior of healthy servers.
The algorithms took as input a large number of metrics and summarized the server's behavior into a new "meta-metric." Instead of watching each of these metrics, users can now watch this new metric as an alternative for a given server.
Now that Overseer had created a simple way of reasoning about a server's behavior at any point in time without needing to watch a large number of metrics, the next step was to figure out how to extend this approach to help identify unhealthy servers. What Overseer did was use its trained model to generate the "meta-metric" for each of the servers in real-time. This reduced the space of information requiring examination drastically. Next, Overseer applied a second layer of analysis across all the "meta-metrics" to find the outliers. These outliers helped identify the unhealthy servers.
Overseer was able to catch many unhealthy servers that were impacting customers.
One issue that it caught was when a customer submitted a large job and the performance of a small percent of their servers were slowly degrading. Because of the way Overseer's algorithms work, it was able to spot unhealthy servers hours before they would have been caught otherwise.
After observing these results, the customers were convinced that machine learning could add a lot of value to their business. By identifying these unhealthy servers scalably, they would be able to re-route traffic to healthy machines to reduce customer impact. Not only can it allow them to scale the company by taking on more customers (and increasing their server farm size), but do it in a way that won't require throwing more bodies at the problem...all while maintaining pristine customer experience!
Our approach to uncovering unknown-unknowns had a number of benefits including:
1. A simple way to watch a large number of metrics without having to tweak and maintain a lot of thresholds.
2. As a result of (1), Overseer was also able to mitigate the alert fatigue problem.
3. Because Overseer's "meta-metric" was constructed from an original input space of many metrics, the only way it'll get flagged as an outlier is if many of the input metrics start to degrade. Thus, Overseer's findings are more acculate with fewer false positives. This benefit further mitigates alert fatigue.