Ramin's blog (ramin at ramin dot net) Your Ad Here

Sunday, January 25, 2009

Elements of a monitoring system



A good monitoring system is hard to find. There are plenty of tools/scripts/applications that provide a solution for a narrow use case. For example, someone wants to have a way to poll a service to see when it is not available, so they write a script to do that. There are hardly any systems that provide a good end-to-end solution for monitoring. This is the primary reason why we chose to develop our own monitoring system inside Yahoo. It was the only way we could provide a solution that was flexible enough to fit the diverse usage in Yahoo, while designed to leverage the way Yahoo service engineers operated. Scalability is a big factor for Yahoo and none of the existing solutions address scalability to the extend that satisfies Yahoo.

What are the various elements of a complete solution for monitoring?
They are (in no particular order):


* Data Collection

* Status Tracking

* Alert Generation

* Storage

* Configuration Management

* User Interface



You can argue the list is too short (or too long), but the purpose of the list is to capture the main areas (the main elements). Each area can be broken down to sub-areas. I'll cover each area in a little more depth to provide some clarity. Keep in mind that this topic can have so much details that a book can be devoted to it!

Tuesday, January 13, 2009

Status Tracking

The bare bones component of a monitoring system is status tracking. Status tracking is the act of monitoring the status/health of an entity on the target system. The purpose of status tracking is to be able to answer questions like:
"Is Host X reachable?" or "Is my Web Server up?"

Monitoring System Components - Data Collection


This area covers the different ways that the system collects the metrics, status and other information about what it's monitoring. There are 3 basic types of data the monitoring system cares about:
(a) Metrics
(b) Status
(c) Configuration

I cover Data for Configuration in the Configuration Management section.
I cover Data for Status in the Status Tracking section.
Data for metrics can find its way into the monitoring system by the monitoring system either polling (pulling) the system/component being monitored, or by the target system pushing the data asynchronously into the monitoring system. A good monitoring system has the flexibility to provide both options as each fits a particular system behavior characteristics and requirements.


In the above diagram, the target system sends metrics to the monitoring system asynchronously (1). The monitoring system didn't initiate the request for metrics. This is typically done by either leveraging a scheduler in the target system, or some other trigger (e.g. metrics generated by activity on the target system). When the monitoring system initiates the metrics request (2), the scheduling of that request is performed on the monitoring system, and the request is sent to the target system. After the request is sent, the monitoring system (typically) waits for the response. This would ensure that the metrics that are returned in the payload are associated with that particular request (and more importantly with the timestamp of the request). In rare occasions, the monitoring system can send the request, and can accept the result asynchronously via method (1). This option is typically difficult to accommodate since the slight clock difference between the monitoring server and the target system can throw off the logic of associating the metrics to the real timestamp.

There are pros and cons of each option.

Sunday, January 11, 2009

Approaches to Monitoring Systems


Before I dive into the topic, let me give a brief overview. What's a monitoring system? Well, if you're developing or operating a complex online system, you want to know when things break (better yet, you want to know before things break that the system is about to break). A good monitoring system let's you understand the characteristics of your system by showing different types of data it has collected (e.g. how is your CPU performing over time, or is your server running out of disk space, or what's the average transaction rate of your system). Having various views of your system you can then realize where the bottlenecks and where the weak points of your system are. It will also be used as a tool to validate that your changes are addressing the real problem areas.

There are two approaches to monitoring a system:
1) Monitoring from outside-in
2) Monitoring from inside-out

Monitoring from outside-in

This approach is often taken by hosted monitoring services. Examples of these types of services are Gomez and Keynote. There are a large number of small players who offer similar synthetic checks of a URL.

Monitoring from inside-out


Monitoring your servers, systems, devices and services/applications extensively and thoroughly is the key to understanding how the organs of your system are behaving. This is a difficult and time consuming exercise. It requires a lot of research on picking the right tool(s) to gather the data (if it exists). If the data does not exist, get ready to spend even more time to instrument whatever component that is missing the data. Picking the right tool to gather and visualize the data is no easy task. Why? Well, because the monitoring systems process a lot of data, they have to be designed to address the most used use cases. For example, consider these two scenarios:
- monitoring system user wants to capture a small number of metrics but at a very high frequency (e.g. 50 metrics that are generated once every 1 second)
- monitoring system user wants to capture a large number of metrics but at a lower frequency (e.g. 3000 metrics that are generated once every 15 minutes)

In both of these examples, the monitoring system is handling 45000 metrics values every 15 minutes, but the approach to implement a solution for each can have a large impact on how efficiently it handles the volume of data.
I will concentrate the rest of the discussion on the design of a monitoring system that is used to monitor from inside-out.

About Me

Ramin Naimi
I have over 18 years of experience in various high-tech industries. I am currently leading the Web Infrastructure team in TinyPrints, a small company that is revolutionizing the Greeting Card business. In recent past, I had managed Yahoo’s monitoring infrastructure group (part of platform engineering group). We developed and operated Yahoo’s internal monitoring and operational metrics collection systems. I have a wide range of experience from client side development to distributed servers.
View my complete profile