How it all started – It all started with a conversation a day ago with a colleague about the need to monitor system performance for a complex multi-tiered financial application. The colleague wanted to know what could be done to help the customer manage his application performance proactively in production using one of the commercial diagnostics tools. Apparently the vendor had sold the client the concept of using the diagnostics tool to manage performance in Pre-production (Stress & Volume Test) and Production Environments. Now, don’t get me wrong. I am not against using diagnostics and profiling tools to track and monitor performance issues in production.
One should however be careful of the type of diagnostics and profiling tools being embedded into production environments due to the nature of overheads imposed by some of the tools (contrary to the claims of some of the vendors) and the licensing implications they bring to the table. Above all of that while Diagnostics tools are good to get an insight of what’s happening within an application container that’s only part of the information you really need to manage performance from an end to end standpoint especially when dealing with Web 2.0 media rich applications. Monitoring is holistic and needs to be looked upon from an End to End perspective keeping in the customer accessing information over various different network topologies.
This got me thinking about the challenges I’ve faced over the years at customers locations worldwide trying to architecture a single unified view of performance. However before I go on I’ll spend a few minutes articulating my Utopic view of holistic System Performance.
Utopic View of System Performance – A utopic view of system performance gives the Performance Engineer, the CTO, the Operations Manager, the Database Architect, the System support Engineer, etc. the relevant performance information they would like to see with regards to their specific set of applications and relevant application infrastructure. A Utopic view of system performance consists of a set of integrated enterprise wide dashboards that allow for proactive Service Level Management with a focus on Business & IT Service Level Management. These Service Level Management Dashboards allows IT and Business to track and measure performance on metrics that make sense to them. These dashboards also then provide drill down views into various underlying system components for a better understanding of performance of the underlying physical or virtual infrastructure the application is being hosted upon. Let’s look at this a bit more in detail:
- System Monitoring – I would like to define System Monitoring as the enterprise wide approach to monitoring of system infrastructure (i.e. networks, operating systems, applications, business transactions, etc.) to be able to proactively identify performance issues across the stack. System monitoring is a holistic approach to monitoring that ensures the IT within the enterprise is collecting the relevant set of metrics across the stack for all the relevant systems at the right levels of granularity, allowing for correlation of events across the stack with the objective of identifying performance issues before they transpire into actual show stoppers or SLA breaches. Let’s go one level further down into System monitoring and look at the relevant components.
- Network Monitoring – Network monitoring as the name suggests has pure focus on collecting performance metrics for relevant networking components (i.e. routers, switches, load balancers, etc.) and pipes across the enterprise.
- Infrastructure Monitoring – Infrastructure monitoring is focused on collecting relevant performance metrics for physical, virtual infrastructure and the operating systems that are hosted on top of it.
- Application Monitoring – Application monitoring is focused on collecting relevant performance metrics for all the key application performance metrics across the stack. This would include the web servers, application servers, integration servers, Batch servers, databases, etc.
- Business Transaction Monitoring – Business Transaction monitoring consists of pulling together transactional performance information for your business critical customer transactions. There are various ways of doings this ( i.e. End User Monitoring or Real User Monitoring) and we won’t go into the details at this point in time.
- Application Diagnostics – Application Diagnostics allows for monitoring the performance of the application container while maintaining really low performance overheads. Application Diagnostics tools enable the operations team to identify potential areas of concern across the various application tiers watching the code execution paths for a live application in production.
As an Operations Manager or a Business IT Lead it’s essential that you have a rolled up view of System Performance to help you understand how well you’ve managed to meet your SLA’s. A Utopic view of System Monitoring is essential for IT to be able to proactively manage performance across the stack and correlate performance issues across the various relevant tiers.
Bringing on old baggage –Monitoring is an interesting space and always has been close to my heart. Performance of computer systems is no different from performance of any other system i.e. technical or non-technical. Anyone who understands the basics of system performance will tell you that tracking performance is an integral part of improving it. You need to know how your system (application and infrastructure) is performing before you start looking for areas of improvement.
Early on in my career as a Performance Engineer I was taught the following basic principles:
- If you don’t measure your performance you can’t manage it
- If you don’t measure your performance you can’t improve it
- If you don’t measure your performance you probably don’t care
Experience tells me that the number of enterprises with truly integrated monitoring tools and dashboards are a rarity rather than a norm. Based on my experience here is how enterprises usually choose to deploy their monitoring assets:
- Reactive performance monitoring, not proactive performance monitoring – It’s quite rare that I bump into someone who understands what a proactive approach to system monitoring is all about. Most enterprises use monitoring tools to setup alerts across the stack that tell them when something has fallen over. It takes a lot more effort to articulate an approach where a series of alerts and thresholds are set to help identify potential breaches before they actually occur.
- Monitoring tools running in silos – Most organizations have amassed their collection of monitoring tools over a period of time and have base their investments on the need of the hour rather than a holistic approach to their system monitoring requirements. While that’s perfectly understandable for a large enterprise, one should look at rationalizing investments in the monitoring space to ensure that the data being collected is relevant, actionable and can be correlated to identify performance issues across the stack.
- Lack of co-relation of data across the various tiers – We use monitoring tools to generate alerts and events across the stack. These alerts and events are meant to provide insight into performance issues and potential Service Level breaches across the stack. With data being collected in silos across different tools the ability to correlate information is lost. All the expensive do their jobs wonderfully well; but getting them to speak to each other to provide an integrated view of performance is a challenge that few organization has addressed really well.
- Inability to view detailed historical data – Monitoring data is actionable and useable when collected at the right set of intervals with the appropriate levels of granularity. Unfortunately most organizations deploy monitoring tools to collect data for a large number of performance metrics without paying much heed to the frequency at which data is collected and how the data is stored or rolled up leaving the data pretty much useless for any analysis, forecasting, performance modelling or capacity management.
- Complex, expensive and monolithic – Tools that provide enterprise wide capability to monitor system performance proactively across the stack with the ability to roll up performance at a Service Level have traditionally been really expensive. This has changed a great deal in the last 5-6 years with numerous smaller organizations offering integrated system monitoring solutions. Traditional monitoring tools have been very expensive to license, very expensive to deploy, manage and maintain. They’ve also lacked the ability to integrate easily with other 3rd party tools e.g. HP BAC (Application Monitoring) integration with IBM Netcool (Network Monitoring)
Pragmatic way forward – So we all agree that no enterprise is perfect and we are all working in organizations which have acquired monitoring tools and assets over a period of decades. The question is how do you leverage your investments and make good of what you’ve got:
- Rationalize your investments – Organization are like houses, they collect junk over a period of time. If it’s time for a garage sale, make it happen. Open up your doors for a critical review of your system monitoring infrastructure, bring in some external guns to review what you’ve got and rationalize investments across the board.
- Leverage SaaS where possible – SaaS options have increased drastically over the last year or two. They still have a long way to go and don’t yet offer the entire suite of integrated monitoring capability a large organization would need but might fit the bill for smaller enterprises with not so complex IT landscapes. This is definitely a space to be watching.
- Look for ease of integration and integration API’s – Look for integration with the other tools and vendors within the enterprise. Understand how easy or difficult it is to extract data, massage it, transform it and integrate it with your main Service Level Management dashboards. Invest in solutions that provide your organization the flexibility and integration you desire. Better still perform a Proof Of Concept and trial out the integration with your existing investments. Don’t take the sales pitch as gospel truth.
- Choose vendors carefully, bigger don’t necessarily mean the best – A lot of the smaller vendors out there have come up with strong offerings that play well with third party solutions e.g. App Dynamics, New Relic, App Neta, etc.. You’ll generally find that to build a holistic enterprise wide monitoring approach you’ll need to stitch together offerings from multiple vendors and a better approach is to stick to vendors who have good 3rd party API support with a strong product offering in your particular area of need.
- Invest for the long term – Monitoring is a long term game, speak to your business IT leads, understand their pains, understand what is it they would like to see, design an overall approach and get a feel of the investments they are willing to make to fix the gaps. Invest in solutions that are flexible and offer the potential scalability and integration that you would need.
Closing notes – Holistic monitoring as defined in the Utopic view above is not impossible to achieve. It requires a combination of common sense, pragmatism, longer term strategic view for the enterprise and above all, lots of determination. As always, please write to us with your thoughts, comments, and input at trevor at practical performance analyst dot com. Also please share your comments with rest of the community using the comment boxes below.