One of the big advantages of being in the Performance Engineering space is that you have the wonderful opportunity including the privilege of working with break through or bleeding edge technologies and at times with application stacks that are quite obsolete and passe. As part of a Performance Engineer job you would have learned to deal with large complex systems, bench-marked them, break them, tuned them, optimized them and scaled them to meet your customers Non Functional Requirements or Performance Objectives. And all along the way unknowingly you would have generated enormously large amounts of data.
What is Data Exhaust – Everything we do in life generates data. The so called social media pundits out there call the data trail we generate online due to our social interactions as, “Data Exhaust”. Data exhaust or data trail is generated when you interact with your friends, colleagues, business partners or just anyone online using social media applications and other related technologies. To help you understand where I am coming from on this let’s look at the day in the life of a “connected individual” i.e. someone like yourself.
- You wake up in the morning, log into your Andriod or Apple device and check face book. You get the adrenalin rush of seeing half a dozen messages and you respond to all of them in a hurry.
- You like a particular conversation with your colleagues including the pictures they’ve posted about you and your hit the like button on a few of them
- Later on over breakfast, you login to your Google mailbox using your ipad, clear off the relevant ones and then dash off to work
- On your way to work you hop off at your favorite coffee shop, order your favorite coffee, pay for it using your debit card and then use the coffee shop mobile application (on your phone) to read the “QR” code. This is the equivalent of a stamp on your frequent coffee user card.
- At work you log into Linked In, open up your Linked In groups and connect to relevant topics of interest. There you interact with your colleagues and friends on conversations that interest you. You click on the like button when you like a certain conversation including adding your two cents to it.
- You navigate to your favourite movie review website, check out a few movie trailers, give the ones you like a thumbs up and the ones you dislike a thumbs down
- It’s time for lunch; you want to check out which restaurants you could go to along with your mates. You open up google maps on your phone and look for interesting places to eat at.
- You get back home, use your Facebook credentials to login into to Netflix or something equivalent, open up your queue of movies, review the ones you like and move onto watching something else.
- You get fed-up with Netflix, use your Facebook credentials to login into Hulu, check out the latest episodes of your favourite soap opera and then head off to grab some dinner
Each of these actions of yours has left significant amounts of data on the servers of the respective application service providers. This is what Data Scientists and Social Media Pundits call a “Data Trail or Data Exhaust”. Organizations are beginning to use this data to learn more about the customers, understand their customer’s behaviors in more detail and react to them in real time with the objective of staying ahead of the competition. So you would be wondering by now what’s the relevance of this so called “Data Exhaust” to us Performance Engineers and why should we be interested in what it has to offer.
An opportunity or a nightmare – Now let’s look at how we as Performance Engineers have been affected by the lack of easy to use and affordable tools for purposes of data mining, data analysis and modelling. I personally think it’s fascinating that we now have access to all of this data and most importantly the tools to analyze them. The actions of our users, i.e. the interaction of our users with the application we manage or host; generates enormous amounts of data. The fact actually is we’ve always had this data at our disposal; however we’ve faced numerous challenges trying to make sense of the data, source the data and identify patterns that could help us manage our customer’s applications optimally while helping our customers run their businesses more efficiently. Here are some of the issues we as Performance Engineers have faced over the years trying to make sense of the data around us.
- Identifying relevant data to extract from a customer’s applications i.e. relevant business and infrastructure workload drivers
- Extracting relevant data from customer applications i.e. web servers, application servers and data bases has been a huge challenge
- Correlating the different data sets obtained from different tiers with each other including correlation of the sessions across the various applications
- Conversion of this “textual” data into an optimized form for storage for the longer term
- Affordable tools to manipulate relevant data for purposes of reporting, analysis, modelling and forecasting
Now as Performance Engineers our life we are constantly faced with challenges of trying to find the needle (performance or capacity bottlenecks) in a haystack (the customer application) and most of us have to do it on a daily basis. Making sense of the data and having the relevant tools to get the job done is key to understanding what is happening behind the scenes. To obtain a good ideas of what your application performance looks like, to understand what customer actions trigger the performance issues, to understand how the performance issues or user actions translate to capacity issues and to help your customer manage application performance and platform capacity proactively you have to understand data. Understanding the data generated by the systems is the first step towards finding the needle (performance or capacity bottleneck) within the haystack (the customer application).
Let’s look at the basics – So we all agree that understanding data and making sense of it, is key to finding the needle in the haystack. Over the years when attempting to identify and fix Performance or Capacity issues, I’ve had to struggle a lot trying to identify the relevant data sources, obtain data from the various data sources and then transform it into a form that could be fed into excel or other tools that would help with data visualization. Frankly, I’ve never been a great fan of excel and I would do everything possible to build an automated solution to get the job done. Now that we’ve come so far I would like to take this opportunity to introduce you to some of the important terms that will help you understand the science of data.
- Data Visualization – Data Visualization is the science of visualizing data. Data Visualization in lay man terms is about viewing the data from different perspectives, slicing it, dicing it and viewing it using the different data visualization techniques i.e. time series plots, scatter charts, bar graphs, etc. You can reach mode on Data Visualization at http://en.wikipedia.org/wiki/Data_visualization
- Data Science – Here’s what Wikipedia has to say about Data Science, “Data science incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing and high performance computing with the goal of extracting meaning from data and creating data products. Data science is a novel term that is often used interchangeably with competitive intelligence or business analytics, although it is becoming more common. Data science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners.” You can read more about Data Science at – http://en.wikipedia.org/wiki/Data_science
- Data Scientist – Here’s one of the definitions by IBM that comes close to how I would define a Data Scientist, “A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modelling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge. Good data scientists will not just address business problems; they will pick the right problems that have the most value to the organization. You can read more about IBM’s definition of the Data Scientist including IBM’s vision and the tools they’ve got to offer at – http://www-01.ibm.com/software/data/infosphere/data-scientist/
- Modelling – Modelling is the art and science of using various algorithms or techniques to identify patterns within a particular set(s) of data. These models could be built using analytical, statistical or simulation modelling techniques and used to proactively forecast / manager application performance including platform capacity.
So why am I writing all of this – There are quite a few reasons I have taken off on a data related tangent with a Performance Engineers mindset.
- Performance Engineering is a really poorly understood discipline worldwide
- Most of the tools we as Performance Engineers have been brought up on are really expensive
- There is a dearth of affordable tools for most tasks we are asked to do
- I claim that Data Science is an integral part of Performance Engineering
- The Performance Engineer should build the skills and capability to venture into Data Science
As Performance Engineers we are constantly faced with challenges of trying to find the needle (performance or capacity bottlenecks) in a haystack (the customer application). Making sense of the data while having access to the relevant & affordable tools to get the job done is key to understanding what is happening behind the scenes. To help your customer manage application performance and platform capacity proactively you have to understand the “Data Exhaust”. Understanding the data generated by the systems is the first step towards finding the needle (performance or capacity bottleneck) within the haystack (the customer application).
I intend to write a series of articles covering each of the challenges a lot more in detail. We will be on a quest for affordable tools and easy to use tools. We are on a quest to make “Data Science” for the Performance Engineer easy to digest. I will also be creating a few additional sections at Practical Performance Analyst to cover Data Science for the Performance Engineer. This like most other things will evolve over a period of time. If you’ve got the inclination to learn, share and grow the community please drop me a line at trevor at practical performance analyst dot com.
Reach out to us with your ideas, thoughts and input on making this a useful series.