Whether your IT operation is a small-scale endeavor or a Web-scale enterprise, systems monitoring plays a critical role in the delivery of your services. The ability to quickly and accurately assess the state and health of your infrastructure and applications is no longer a luxury, it’s a categorical requirement. The question is no longer, “Do you monitor your infrastructure and applications?” Rather, the question is, “How do you monitor your infrastructure and applications?”
I’ve spent quite a few years asking and answering this question, and the many questions that follow. But lately I’ve been focused on a subtle differentiation with regard to tools: how do you distinguish between tools used by your operations staff and tools used by your engineering staff? These two teams have vastly different expectations and requirements for monitoring solutions. Operations teams need ready access to the state and status of a server or service; engineering teams need ready access to the detailed performance of individual components.
One approach to providing both teams with the information they need is to implement multiple views or consoles within a single monitoring solution. Benefits of this solution include reduced cost (when compared with deploying multiple tools), reduced complexity, and, yes, it moves you closer to that storied “single pane of glass” we keep hearing about. But the downside here is non-trivial: your operations staff will expect the monitoring system to be highly available. That’s where the divide between operations and engineering becomes apparent.
Engineering teams typically avoid running tools in a debugging capacity unless they are actively diagnosing an incident. And for good reason: when you increase the logging level, as an engineer typically does as she seeks to understand the root cause of a problem, the volume of data generated grows exponentially. And this data affects three core components of your infrastructure: CPU cycles to generate the data, network to transmit the data, and storage to, well, store the data. But frequently, a sudden increase in logging can impact the performance of your monitoring tools. For example, if you suddenly crank up your syslog to debug on a few servers, your monitoring database can grow very, very quickly, and the rate at which new records are created can impact the performance, even the availability, of the database, which in turn impacts the performance and availability of your monitoring system.
For these reasons, consider deploying a separate tool for your engineering team to use when assessing server and service performance. A dedicated engineering tool will give engineering teams the flexibility to make modifications to the tool without impacting operations. If, in the course of identifying the cause of an outage, engineers must install new components (a new Orion module, for example), your operations team will not be affected by a brief interruption as the new module is configured.
It’s worth noting that change management practices for operations and engineering tools vary. Because your operations tools serve as early-warning indicators for service health, you’ll want to tightly control changes to that platform to avoid unplanned outages. However, your engineering tools should enjoy a more relaxed change control policy; brief (or even prolonged, depending on the situation) periods of unavailability may not pose a risk to your infrastructure.
As with any solution, consider the use cases for your monitoring tools before you implement. After all, monitoring solutions are simply tools. Operations and engineering teams alike will benefit from a well-considered tools deployment strategy.