Introduction – In order to find out what causes unacceptable production system performance one is required to monitor the relevant application and infrastructure components during normal operation including when running a load test. A system performance monitoring framework can be compared to diagnostic equipment used by doctors at clinics trying to understand the nature of issue the patient is facing. Without access to such diagnostics equipment, a doctor has has to rely on what the patient can tell him/her combined with the visual information that can be obtained from by inspecting the relevant part of the body.
So What’s The Issue – Today’s systems are mostly designed to expose hundreds of counters for purposes of monitoring. The commonly utilized categories for the various monitoring metrics are:
- The counters that report utilization of system resources between a specific time interval (for example, percent of total CPU utilization, percent of CPU utilization by a particular process, percent of physical disk utilization, etc.)
- Throughput measured in a number of operations executed at a given resource during particular time interval (for example, network throughput measured in bytes/second, number of I/O Reads/second, etc.)
The common characteristic of both categories is that all their counters are time dependent. That means an accuracy as reported by a counter parameter depends on accuracy of the function measuring the time combined with the duration over which the metrics is measured. Unfortunately, in a world dominated by virutalization and cloud computing most system performance monitoring tools are flawed in the way the measure, manage and store performance metrics. A detailed discussion on that can be found in [1, 2]. Here’s a summary of the issue –
- In a virtual environment the hypervisor treats guest OS as any other process that can be stopped and resumed at any time.
- When guest OS is stopped, it is unable to accept time interrupts from hardware clock. That means guest OS misses time intervals and does not measure time when guest OS is not running making time-dependent metric not very accurate.
Taking into consideration this fact, what are the right objects to monitor in virtualized environments?
Queuing Models To The Rescue – To find it out let’s invoke system’s conceptual queuing model. We described representation of the systems by the models in the post . At this moment we just reiterate that a queuing model is an abstract representation of a system that includes a depiction of the system resources as well as the demands for resources generated by the users. More on queuing models of the systems can be found in a book .
Queuing models create systematic framework for system performance analysis and capacity planning. The queues are the major phenomenon defining system performance, because waiting time in the queues adds up to the time a transaction is processing by system resources. A queue is an indication of an imbalance between demand generated by fluctuating user’s workload and availability of system’s resources to satisfy the demand. As such, while troubleshooting a performance bottleneck, it is necessary to find out where in a system the queues are building up and exceeding the acceptable thresholds. That can be done by monitoring internal system queues. Because instantaneous counts of the queue lengths do not depend on implementation of a system timekeeping mechanism, this approach delivers representative performance metrics for any environment.
As we pointed out in the post , an Enterprise Application requests two kinds of resources to be allocated to process the user transactions:
- CPU time (data processing)
- I/O time (data transfer)
- Network time (data transfer)
- Software connections to the servers and services (for example, Web server connections, database connections)
- Software threads
- Software locks
- Storage space
- Memory space
Active resources implement transaction processing and data transfer. Passive resources provide access to active resources. In order to be processed by any active resource, a transaction has to request and get allocated the passive resources. If any of the assets needed for transaction processing is not available because all supply is taken by other transactions, a transaction will wait in a queue until an asset is released. Indeed, wait time will increase transaction response time degrading system performance.
Examining queues is not an exceptional task – it can be done using built into operating systems performance monitors and counter-reporting commands. We compiled in the table below information on the Windows counters that deliver instantaneous queue lengths for different system objects. The table is far from all-inclusive, but it is sufficient enough to demonstrate queue-based performance monitoring tactic.
This list of queue-reporting counters provided above have been extracted from Windows Performance Monitor and should be used as a possible list of system performance metrics to monitor; its goal is to demonstrate the primary tenets of queue monitoring. Today’s application landscapes are tremendously complex and heterogeneous in nature. Such complex applications with components across multiple data centers require monitoring of the relevant queues, reporting counters for multiple system components, counters pertaining to various different operating systems, counters pertaining to different physical compute/storage/networking components, as well as application-specific counters exposed by built-in instrumentation. The common denominator across all of them remains the same i.e. if a counter suggests a build of up of a queue, its inevitable that thresholds will be cross at some point and system performance is degrades
In addition to timekeeping issue on Virtual Machines, the technological advances like hyperthreading, power management, CPU entitlement, and the others also distort time-dependent performance counters . That makes queue monitoring the trusted and preferred methodology for wide range of the systems that are built upon such sophisticated technologies.
- VMware document: “Timekeeping in VMware Virtual Machines” http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf
- Bernd Harzog. “White Paper: Application Performance Management for Virtualized and Cloud based Environments” http://www.virtualizationpractice.com/blog/wp-content/plugins/downloads-manager/upload/APM_for_Virtualized_and_Cloud_Hosted_Applications.pdf
- Leonid Grinshpan. —Put here a referral to my post on PPA “Application mental model”
- Leonid Grinshpan. Solving Enterprise Applications Performance Puzzles: Queuing Models to the Rescue, Willey‐IEEE Press, 2012, http://tinyurl.com/7hbalv5
- Leonid Grinshpan. Building Systems That Perform : Application Awareness – http://www1.practicalperformanceanalyst.com/2014/07/22/building-systems-that-perform-application-awareness/
- Adrian Cockcroft “Utilization is Virtually Useless as a Metric!” http://www.hpts.ws/papers/2007/Cockcroft_CMG06-utilization.pdf
Dr. Leonid Grinshpan (LinkedIn) is currently Technical Director at Oracle with a focus on Enterprise applications capacity planning, load testing, modelling, performance analysis, and tuning. Leonid has a few decades of experience on two complementing areas: computer science and information technology (IT) engineering. He holds a Ph.D in computer science is is also an author of a book on mathematical modelling. Leonid has worked on over 200 capacity planning and performance tuning projects for Fortune 500 customers over the past eleven years. Leonid is also the recipient of the highest award in the USSR for excellence in IT engineering – Award of the Counsel of Ministers of the USSR for design and implementation of an open CAD system for microprocessor based products. He tested out his entrepreneurial skills by co-founding the first Belarusian-American Joint Venture Software Security Belarus (acquired in 1997 by USA company Rainbow Technologies).