The percentage of servers which are virtualized continues to grow, but management visibility continues to be a challenge. In this blog post we look at the three key monitoring capabilities – full metal, datastore, and performance – to give you the visibility and control you need to keep your virtualized applications performing well.
Before we start, below is a description of the information models which are important to hypervisor management:
Common Information Model
Common Information Model or CIM is an open standard that defines management and monitoring of devices and elements of devices in a datacenter.
VMWare infrastructure API
The VI API is a proprietary implementation of CIM provided by VMWare for management and monitoring of components related to the VMWare hypervisor.
Full metal monitoring
The fan is essential for proper server function. When rack density goes up, server volume shrinks and fans need to work at higher speeds, which mean more wear and tear. A broken fan in a server can quickly cause major heat build up that affects the server and possibly neighbouring servers. The good thing is that it’s relatively easy to monitor the state of the fans. The CIM_fan class exposes a property called HealthState that contains information about the health of a fan: OK, degraded state, or failed.
Power supply health is important to monitor. Most enterprise servers can be configured to have redundant power supplies. In addition, its good to have a spare in backup. OMC_Powersupply is a class the exposes the “HealthState” property for each PSU in your server. Just like the fan health, the PSU is determined to be OK, degraded, or failed.
VI API can be used to measure average power usage, which gives an indication of the server utility cost. More power usage means more heat, which equals even more utility costs in the form of heat dissipation.The VI API counter power.power.average results looks like this:
Raid controller, storage volume and battery backup
Three key storage elements that you should monitor are the raid controller, storage volumes and the battery. The controller and disks seems obvious, but the battery? In many cases a high performance raid controller will have a battery to backup the onboard memory incase of a power outage. The memory on the controller is most commonly used for write back caching and when the server loses power, the battery ensures that the cache remains consistent until you restore power to the server and its content can be written to disk.
Utilization, IOPs and latency are metrics that should be monitored and analyzed together. When you have performance problems in a disk subsystem, an “ok” latency can tell you to go and look for problems with IOPs. High utilization can tell you why you may not get the expected IOPs out of the system and so on.
The utilization can be calculated using the capacity and freespace properties of the DatastoreSummary object.
IO operations per second can be monitored using a VI API datastore.datastoreIops.average counter; which provides an average of read and write io operations.
Latency can be measured using datastore.totalWriteLatency.average and datastore.totalReadLatency.average counters. They will show you average read and write latency for the whole chain, which includes both kernel and device latency.
Threads scheduled to run on a CPU can either be in two states: waiting or ready. Both of these states can tell a story about resource shortage. The lesser evil of the two is the wait state, which indicates that the thread is waiting for an IO operation to complete. This can be as simple as waiting for a answer on a host external resource, or waiting on disk time. The more serious state is the so called “ready state” which indicate that the thread is ready to run, but there is no CPU free to server it.
Memory ballooning and IOPS
Memory ballooning is a process that can happen when a host experiences a low memory condition and probes the virtual machines for memory to free up. The balloon driver in each VM tries to allocate as much memory as possible within the VM (up to 65% of the available VM memory), and the host will free this memory to add to the host memory pool.
The memory ballooning counter, mem.vmmemctl.average, can show when this happens. So how can memory ballooning make a dent in your IO graph you may ask? After the host reconciles memory from VMs, these VMs can start to use their own virtual memory and start page memory blocks to disk, which is why memory ballooning may proceed a higher than normal IO operation.
Ballooning may happen even if there is no issue; its a strategy for the host to make sure there is free memory for any VM to consume. Host swapping however is always a sign of trouble. There are a number of counters that you want to monitor:
These counters show, both in cumulative and in rate, how much memory is swapped in/out. Host memory swapping is double trouble. No only does it indicate that you have a low host memory situation, it is also going to affect IO performance.
Monitoring, reporting and notification on all these metrics can be a challenge. The good news for Kaseya customers is that you can implement the monitoring described in this article using the new network monitor module in VSA 7.0, available now.
- VI API reference
- CIM reference
- CIM monitor – used to monitor hardware and SAN
- Creating a local read-only user for VMware ESXi CIM monitoring
- Datastore monitoring
- VMWare performance monitor
Author: Robert Walker