
Service Level Agreements (SLAs) come in many forms and descriptions in life, promising a basic level of acceptable experience. Typically, SLAs have some measurable component, i.e. a metric or performance indicator.
Take, for example, the minimum speed limit for interstates. Usually, the sign would read โMinimum Speed 45 mphโ. I always thought the signs existed to keep those who got confused by the top posting of 70 mph (considering that to be the minimum) from running over those who got confused thinking 45 mph to be the maximum.
It turns out the โminimum speedโ concept is enforced in some states in the U.S. to prevent anyone from impeding traffic flow. For those who recall the very old โBeverly Hillbilliesโ TV show, I’ve often wondered if Granny sitting in a rocking chair atop a pile of junk in an open bed truck, driving across the country might be a good example of โimpeding the flow of trafficโ at any speed. Although, from the looks of the old truck, it probably couldnโt manage the 45 mph minimum either.
In the world of IT, there are all sorts of things that can “impede the flow” of data transfer, data processing, and/or data storage. While there’s nothing as obvious as Granny atop an old truck, there are frequently Key Performance Indicators (KPIs) that could indicate when things aren’t going according to plan.
What Old School SLAs Miss
Historically, IT SLAs have focused on a Reliability, Availability, and Serviceability (RAS) model. While not directly related to specific events/obstacles to optimum IT performance, RAS has become the norm:
- Reliability โ This โthing of interestโ shouldnโt break, but it will. Letโs put a number on it.
- Availability โ This โthing of interestโ needs to be available for use all the time. Thatโs not really practical. Letโs put a number on it.
- Serviceability โ When this โthing of interestโ breaks or is not available, it must be put back into service instantly. In the real world, thatโs not going to happen. Letโs put a number on it.
In the IT industry, there exist many creative variations on the basic theme described above, but RAS is at the heart of this thing called SLA performance. The problem with this approach from an end-user standpoint is that it misses the intent of the SLA, which is to ensure the productivity/usefulness of โthe thing of interestโ.ย In the case of a desktop, that means ensuring that the desktop is performing properly to support the needs of the end user. Thus, the end userโs productivity/usefulness is optimized if the desktop is reliable, available, and serviceableโฆ but is it really? Consider the following commonplace scenarios:
- The desktop is available 100% of the time, but 50% of the time it doesnโt meet the needs of the end user, e.g. it has insufficient memory to run an Excel spreadsheet with its huge, memory-eating macros.
- A critical application keeps crashing, but every call to the service desk results inย โIs it doing it now?โ After the inevitable โNoโ is heard, the service desk responds, โPlease call back when the application is failing.โ This kind of behavior frequently results in end users becoming discouraged and simply continuing to use a faulty application by frequently restarting it. It also results in a false sense of โreliabilityโ because the user simply quits calling the service desk, resulting in fewer incidents being recorded.
- A systemโs performance is slowed to a crawl for no apparent reason at various times of the day. When the caller gets through to the service desk, the system may/may not be behaving poorly. Regardless, the caller can only report, โMy system has been running slowly.โ The service desk may ask, โIs it doing it now?โ If the answer is โYes,โย they may be able to log into the system and have a look around using basic tools, only to find none of the system KPIs are being challenged (i.e. CPU, memory, IOPs, storage, all are fine). In this scenario, the underlying problem may have nothing to do with the desktop or application. Letโs assume it to be the network latency to the userโs home drive and further complicate it by the high latency only being prevalent during peak network traffic periods. Clearly, this will be outside the scope of the traditional RAS approach to SLA management. Result: again a caller who probably learns to simply tolerate poor performance and do the best they can with a sub-optimized workplace experience.
To Improve End-User Experience, Start Measuring It
So, how does one improve on the traditional RAS approach to SLA management? Why not monitor the metrics known to be strong indicators of a healthy/not so healthy workstation? In this SLA world, objective, observable system performance metrics are the basis for the measurement of a systemโs health. For example, if the CPU is insufficient, track that metric and determine to what extent it is impacting the end userโs productivity. Then do the same for multiple KPIs. The result is very meaningful number that indicates how much of a user’s time is encumbered by poor system performance.

In the case of SLAs based on observable system KPIs, once a baseline is established, variations from the baseline are easily observable. Simply focusing on counting system outages and breakage doesn’t get to the heart of what an IT department wants to achieve.ย Namely, we all want the end user to have an excellent workspace experience, unencumbered by “impeded flow” of any type.ย The ultimate outcome of this proposed KPI vs RAS based SLA approach will be more productive end users. In future blogs, I will expand on how various industries are putting into practice a KPI based SLA governance model.
Subscribe to Lakeside Updates
Receive product updates, DEX news, and more



