Elements of a monitoring system
A good monitoring system is hard to find. There are plenty of tools/scripts/applications that provide a solution for a narrow use case. For example, someone wants to have a way to poll a service to see when it is not available, so they write a script to do that. There are hardly any systems that provide a good end-to-end solution for monitoring. This is the primary reason why we chose to develop our own monitoring system inside Yahoo. It was the only way we could provide a solution that was flexible enough to fit the diverse usage in Yahoo, while designed to leverage the way Yahoo service engineers operated. Scalability is a big factor for Yahoo and none of the existing solutions address scalability to the extend that satisfies Yahoo.
What are the various elements of a complete solution for monitoring?
They are (in no particular order):
* Data Collection
* Status Tracking
* Alert Generation
* Storage
* Configuration Management
* User Interface
You can argue the list is too short (or too long), but the purpose of the list is to capture the main areas (the main elements). Each area can be broken down to sub-areas. I'll cover each area in a little more depth to provide some clarity. Keep in mind that this topic can have so much details that a book can be devoted to it!
