Driving IT efficiency is a topic that always makes it to the list of top issues for IT organizations and MSPs alike when budgeting or discussing strategy. In a recent post we talked about automation as a way to help reduce the number of trouble tickets and, in turn, to improve the effective use of your most expensive asset – your professional services staff. This post looks at the other side of the trouble ticket coin – how to minimize the mean time to resolution of problems when they do occur and trouble tickets are submitted.
The key is to reduce the variability in time spent resolving issues. Easier said than done? Mean Time to Repair/Resolve (MTTR) can be broken down into 4 main activities, as follows:
- Awareness: identifying that there is an issue or problem
- Root-cause: understanding the cause of the problem
- Remediation: fixing the problem
- Testing: that the problem has been resolved
Of these four components awareness, remediation, and testing tend to be the smaller activities and also the less variable ones.
The time taken to become aware of a problem depends primarily on the sophistication of the monitoring system(s). Comprehensive capabilities that monitor all aspects of the IT infrastructure and group infrastructure elements into services tend to be the most productive. Proactive service level monitoring (SLM) enables IT operations to view systems across traditional silos (e.g. network, server, applications) and to analyze the performance trends of the underlying service components. By developing trend analyses in this way, proactive SLM management can identify future issues before they occur. For example, when application traffic is expanding and bandwidth is becoming constrained or when server storage is reaching its limit. When unpredicted problems do occur, being able to quickly identify their severity, eliminate downstream alarms and determine business impact, are also important factors in helping to contain variability and deploy the correct resources for maximum impact.
Identifying the root cause is usually the biggest cause of MTTR variability and the one that has the highest cost associated with it. Once again the solution lays both with the tools you use and the processes you put in place. Often management tools are selected by each IT function to help with their specific tasks – the network group will have in-depth network monitoring capabilities, the database group database performance tools, and so on. These tools are generally not well integrated and lack visibility at a service level. Also correlation using disparate tools is often manpower intensive, requiring staff from each function to meet and to try to avoid the inevitable “finger-pointing”.
The service level view is important, not only because it provides visibility into business impact, but also because it represents a level of aggregation from which to start the root cause analysis. Many IT organizations start out by using open source free tools but soon realize there is a cost to “free” as their infrastructures grow in size and complexity. Tools that look at individual infrastructure aspects can be integrated but, without underlying logic, they have a hard time correlating events and reliably identifying root cause. Poor diagnostics can be as bad as no diagnostics in more complex environments. Investigating unnecessary down-stream alarms to make sure they are not separate issues is a significant waste of resources.
Consider the frequently cited cause of MTTR variability – poor application performance. In this case there is nothing specifically “broken” so it’s hard to diagnose with point tools. A unified dashboard that shows both application process metrics and network or packet level metrics provides a valuable diagnostic view. As a simple example, a response time application could send an alert that the response time of an application is too high. Application performance monitoring data might indicate that a database is responding slowly to queries because the buffers are starved and the number of transactions is abnormally high. Integrating with network netflow or packet data allows immediate drill down to isolate which client IP address is the source of the high number of queries. This level of integration speeds the root cause analysis and easily removes the finger-pointing so that the optimum remedial action can be quickly identified.
Once a problem has been identified the last two pieces of the MTTR equation can be satisfied. The times required for remediation and testing tend to be far less variable and can be shortened by defining clear processes and responsibilities. Automation can also play a key role. For example, a great many issues are caused by miss-configuration. Rolling back configurations to the last good state can be done automatically, quickly eliminating issues even while in-depth root-cause analysis continues. Automation can plays a vital role in testing too, by making sure that performance meets requirements and that service levels have returned to normal.
To maximize IT efficiency and effectiveness and help minimize mean time to resolution, IT management systems can no longer work in vertical or horizontal isolation. The inter-dependence between services, applications, servers, cloud services and network infrastructure mandate the adoption of comprehensive service-level management capabilities for companies with more complex IT service infrastructures. The amount of data generated by these various components is huge and the rate of generation is so fast that traditional point tools cannot integrate or keep up with any kind of real time correlation.
Learn more about how Kaseya technology can help you manage your increasingly complex IT services. Read our whitepaper, Managing the Complexity of Today’s Hybrid IT Environments
What tools are you using to manage your IT services?