Cut through the noise and find the signal that matters
Last week, I discussed discovery in the virtual environment, and how important it is for virtual admins to know and understand the discovery process. This week, I’ll go over the finer points of alerting in the virtual environment. To truly appreciate the importance of alerting, you have to understand the pain that comes with incorrect alerting. Alerting starts with monitoring, which is an important aspect of discovery. The data from systems, VMs, and applications being monitored can provide valuable insights into the data center ecosystem, but that data can easily overwhelm the virtualization admin. A constant stream of false alarms and data noise can result in paralysis by over-analyzing thresholds. Suffice it to say, when it comes to alerting, more isn’t always better.
Alerting: Find out when something breaks
The essence of skillful alerting is ensuring that you’re not constantly in front of a monitor, because, frankly, no one has time for that. The noise should be filtered from the signal in such a way that only the most important information is presented to you. The information that’s highlighted should allow you to take more efficient corrective actions on a much narrower problem set. As you gain experience, you’ll become more adept at creating meaningful alerts to bypass even more noise. This will save you time, help you avoid wild goose chases, aka false positives, and generally make your life easier.
In addition to cutting through the metrics noise and data deluge, alerting serves two other critical functions:
- Records that a particular event has occurred, or a threshold has been reached or exceeded.
- Triggers a notification to a virtualization admin for that given event.
Alerting provides the first clues that an event is about to happen, is happening, or has happened. It guides the first steps on the path toward troubleshooting and remediating an event.
All quiet on the alerting front
Skillful alerting is grounded in defining and refining the key metrics, important events, and thresholds for the VMs, resources, and applications that are most critical to your organization’s success. Using a virtualization management tool that enables the process through the alert lifecycle while cutting through the extraneous noise is critical to successful alerting.
A proper tool will save you time in the alert lifecycle as you troubleshoot and remediate issues. It does this by keeping the application stack in context while surfacing key events and trends in metrics prior to the incident that caused your application to slow or go down.
The alert lifecycle spans three primary stages: alert creation, alert handling and routing, and alert feedback.
- Alert creation means deciding on key health and performance indicators and setting thresholds for those indicators. Data and analysis generated from the discovery skill can seed the initial alerts. Common virtualization alerts that you should set up include CPU and memory utilization on the hosts and VMs, storage and network latencies and IOs per second (IOPS), and application-specific alerts.
- Alert handling and routing necessitates creating a meaningful notification in response to the alert trigger, and communicating that alert to the right person who can take the proper action to prevent or resolve the issue. These notifications can include emails, SMS messages, or automated calls to cellphones.
- Alert feedback involves being able to update alerts based on changes in the VMs, applications, or stack, as well as trigger conditions to ensure the right balance of notification to false alarm in the dynamic virtual environment. Your virtual data center ecosystem changes over time as you add and remove applications, resources, and VMs, so you need to be able to alter your alerts and their thresholds as needed to do your job well.
Alerting giveth and virtualization admins need not taketh all
Alerting sets the edge on the number of incidents that are being monitored. This means that you can alert on all the performance counters and events that your virtualization platform supports, but does that really help you fix application performance issues in your VMs, or VMs that stop working? As a virtualization admin, you have to know which counters and events are both relevant and important to the issue at hand.
This is where leveraging a good virtualization management tool comes in. It should have ready-made alerts that cover the most common virtualization issues and events. If you are new to virtualization, you can use these out-of-the-box alerts as starting points on your learning journey. These pre-made alerts should be grounded in well-established virtualization principles. Mastering alerting involves consistently identifying the right alerts in terms of number and types, and knowing how to tailor those alerts to meet your specific IT operational objectives.
A major pitfall of alerting is establishing the right balance between getting visibility into your virtual environment and inundating stakeholders with too many alerts. It could turn into the alert that cried virtual wolf if too many false alarms go out. In order to avoid that situation, virtualization admins should create alerts with connected context. For instance, one might combine multiple thresholds across the stack subsystems of CPU, memory, networking, and storage across a virtual cluster along with application events from a VM into one alert to focus on that specific application.
Alerting is a critical step on the path to becoming a master of virtualization. It declutters the data noise and notifies the virtualization admin of an incident. Ideally, it includes the relevant details about a specific incident in the virtual data center.
Next week, I’ll discuss the remediation skill, one of four core skills that virtualization admins need to master.
You can also download my latest eBook that walks you through each of the 4 essential skills any virtualization admin will need to master.