Article Summary – I spent many years traveling to different companies solving computer (mostly performance) problems. At almost every company there were people there who were unsure how to begin metering in order to catch a problem. Here is my take on how to begin, how often to collect meters and some basic ideas on what to look for in the meters.
This article is excerpted (with modifications) from the following book by Bob Wescott (LinkedIn) – The Every Computer Performance Book : A practical and occasionally funny book on doing computer performance work that works on ANY collection of computers. It covers the things that are always true about performance metering, capacity planning, load testing, modeling and presenting performance results.
Begin Where You Are and Grow : To begin, just begin. Start with whatever metering is in place. Don’t wait for perfect meters or perfect understanding of the meters. There will always be some mystery in the meters. That’s OK. As Teddy Roosevelt once said: “Do what you can, with what you have, where you are.” Let the questions you face guide you as you learn to trust the meters you have and explore for new sources of information.
Collect Meters All the Time : To catch a performance problem you must have meters running when the problem happens. Some problems are expected (like load tests or seasonal peaks), but, if you want to solve the problems that pop up out of nowhere, you really need to have continuous monitoring.
So what meters do you collect for unexpected problems? The same ones you normally do. When any problem happens look for clues as to where the problem is AND clues as to where the problem is not.
When a problem happens, the problem could be literally anywhere in your computing world and thus the problem space encompasses your entire computing world. As you look at your meters, many of them will show that a given part of your computing world is working well. That happy news tells you where not to search for the source of the problem. Many new performance people only focus on the bad news, like young kids playing soccer focus only on the ball.
Paying attention to the good and bad news in the meters results in less wasted time and a more focused approach to finding the root cause of this problem as your metering data shrinks the problem space.
Collect Meters at the Right Frequency : You often have choices to make about how frequently you want to collect metering information. There is no “right” answer to this because it depends on what you are doing. The more frequently you meter, the higher the resolution is on your performance data (e.g. more pixels in the image) and the higher the cost of metering. For internal meters, the “cost” might be just a few CPU cycles and a little disk space. For external third party testing, the “cost” might be real money. Know the cost of your metering and find a good balance between cost and getting the data you need.
My rule of thumb is you need to meter at a frequency that will get you at least two or three good samples during the performance problem you are trying to study.
If you are studying long-term performance changes as the months, and years go by, you can meter at a very leisurely frequency. If your problem happens for an hour a day, sample every 10 minutes. If you are trying to catch a problem that usually lasts for a minute, I’d examine the meters every 10-15 seconds. Even one sample is better than nothing, but multiple samples make the case more convincing.
If your meters are used as inputs to some sort of alert system or dashboard, then you need to test frequently enough so the alarm is raised in a timely manner and there are multiple indications this is a real problem. Especially when some part of the transaction path goes over the Internet, you want multiple confirmations of a problem because sometimes bad things happen on the Internet that are random, short term, and totally out of your control. If you are going to claim the sky is falling, you’d better have some pretty good evidence, as nothing erodes confidence in your work like raising a false alarm.
Pay Attention To Sample Length : The frequency at which you choose to meter is also influenced by the sample length of the meter. Imagine a meter that reports the average utilization of resource X over the previous 60 seconds. If you run that meter once every five minutes, then you are only collecting 12 samples per hour and have no data on resource X during 48 minutes (80%) of that hour. That may be just what you want as the problem you are studying typically lasts several hours. However, if you are looking for a problem that typically lasts a minute, then you only have a one-in-five chance of finding it.
The Line Is Not The Data : Misunderstanding sample length can lead to trouble is when you read too much into charted data. Below I plotted an hour’s worth of data where I collected a meter with a one-minute sample length once every five minutes.
The dots on the line are the values the meter returned. They show the utilization of a resource averaged over a period of one minute. The line is just a line connecting the dots. In reality, we know nothing about what this resource was doing in the time between samples. However, our brain focuses on the line and believes the utilization smoothly changed from one dot to the next. Maybe it did, maybe it didn’t. We don’t know.
Reality is usually noisier, and more chaotic, than the pretty graphs we draw. The lines on those pretty graphs can fool us into thinking we know more than we do. When someone or some tool shows you a pretty graph be sure to understand what the sample frequency and sample length are if you really want to get something out of it.
Synchronize Your Meters : It can make your job simpler and your graphs look cleaner if the programs that collect the meters are synchronized to the top of the minute. This makes the data easier to combine and compare across systems.
Most metering programs or scripts are just a big loop of commands that gather metering data. At the bottom of that loop is usually something that waits until it is time to gather the next round of samples.
Let’s say you want these meters gathered once a minute. If you do the easy thing and just wait for 60 seconds, the meters drift in time because the meters themselves take time to run. At every iteration the meters would start a little bit later in the minute. What you need to do at the bottom of the minute is wait until the beginning of the next minute. Then your meters will stay in sync with the top of the minute.
Confirm What You Are Metering : The only constant in this universe is change. Applications, operating systems, hardware, and networks can and do change on a regular basis. It’s easy to start the right meters on the wrong system. It’s easy to miss an upgrade or a configuration change.
Before any meter-gathering program settles down into its main metering loop, it should gather some basic data about where it is and what else is there. Gather things like:
- System name and network address
- System hardware CPU, memory, disk, etc.
- Operating System release and configuration info
- List of processes running
- Application configuration info
Most of the time this data is ignored, but when weird things happen, or results suddenly stop making sense, this data can provide a valuable set of clues as to what changed.
Knowing What Is Normal : On any given day you should have a sense of what is normal in your life. This is also true about your computing world. Before you look at your meters, you should have a good guess as to what they will show you. You develop this skill by first guesstimating what the meter will show and then looking at the results. Do that a couple of times everyday and soon you will know what is normal. This is a wildly valuable skill because the unusual result will jump right out at you, and that is often a big clue as to what’s wrong.
In Closing : Start metering. Don’t wait for perfect meters. Keep your meters running all the time and adjust your meters periodically so they sample frequently enough to find the problems you are looking for. Know what is normal so the abnormal stands out clearly.
About the Author : Bob Wescott’s (LinkedIn), is semi-retired after a 30 year career in high tech that was mostly focused on computer performance work. Bob has done professional services work in the field of computer performance analysis, including: capacity planning, load testing, simulation modeling, and web performance. He has even written a book on the subject: The Every Computer Performance Book.
This short, occasionally funny, book covers Performance Monitoring, Capacity Planning, Load Testing, and Modeling. It works for any application running on collection of computers you have. It teaches you how to discover more about your meters than the documentation reveals. It only requires the simplest math on your part, yet it allows you to easily use fairly advanced techniques. It is relentlessly practical, buzzword free, and written in a conversational style.
Bob’s fundamental skill is explaining complex things clearly. He has developed and joyfully taught customer courses at four computer companies and I’ve been a featured speaker at large conferences. Bob’s goal is to be of service, explain things clearly, teach with joy, and lead an honorable life. His goal, at this stage of the game, is to pass on what we’ve learned to the next generation.
As always do Send us an email with your input, comments, feedback and suggestions. If you think you’ve got the the talent, are keen on sharing your knowledge / experiences and are keen to help us grow the community here at Practical Performance Analyst please reach out to us Over email.