Approaches to Monitoring Systems
Before I dive into the topic, let me give a brief overview. What's a monitoring system? Well, if you're developing or operating a complex online system, you want to know when things break (better yet, you want to know before things break that the system is about to break). A good monitoring system let's you understand the characteristics of your system by showing different types of data it has collected (e.g. how is your CPU performing over time, or is your server running out of disk space, or what's the average transaction rate of your system). Having various views of your system you can then realize where the bottlenecks and where the weak points of your system are. It will also be used as a tool to validate that your changes are addressing the real problem areas.
There are two approaches to monitoring a system:
1) Monitoring from outside-in
2) Monitoring from inside-out
Monitoring from outside-in
This approach is often taken by hosted monitoring services. Examples of these types of services are Gomez and Keynote. There are a large number of small players who offer similar synthetic checks of a URL.Monitoring from inside-out
Monitoring your servers, systems, devices and services/applications extensively and thoroughly is the key to understanding how the organs of your system are behaving. This is a difficult and time consuming exercise. It requires a lot of research on picking the right tool(s) to gather the data (if it exists). If the data does not exist, get ready to spend even more time to instrument whatever component that is missing the data. Picking the right tool to gather and visualize the data is no easy task. Why? Well, because the monitoring systems process a lot of data, they have to be designed to address the most used use cases. For example, consider these two scenarios:
- monitoring system user wants to capture a small number of metrics but at a very high frequency (e.g. 50 metrics that are generated once every 1 second)
- monitoring system user wants to capture a large number of metrics but at a lower frequency (e.g. 3000 metrics that are generated once every 15 minutes)
In both of these examples, the monitoring system is handling 45000 metrics values every 15 minutes, but the approach to implement a solution for each can have a large impact on how efficiently it handles the volume of data.
I will concentrate the rest of the discussion on the design of a monitoring system that is used to monitor from inside-out.
0 comments:
Post a Comment