Dr. Leonid Grinshpan (LinkedIn) is currently Technical Director at Oracle with a focus on Enterprise applications capacity planning, load testing, modelling, performance analysis, and tuning. Leonid has a few decades of experience on two complementing areas: computer science and information technology (IT) engineering. He holds a Ph.D in computer science is is also an author of a book on mathematical modelling.
In this series of articles Dr. Leonid Grinshpan (LinkedIn) presents an approach that Performance Architects or Performance Engineers should consider when attacking Performance Engineering related activities across the SDLC. Dr. Leonid Grinshpan (LinkedIn) has decades of experience building and delivering applications that perform. Through this series he sums up some of the learning that might prevent you from making the same mistakes when addressing Systems Performance across the SDLC.
You can read the first post in this series by clicking here – Building Systems That Perform : Part 1
Large System Implementation Challenges – Let’s assume you are in the midst of a large system implementation and unfortunately late in the program (which tends to happen very often) during the performance testing phase you’ve gained insight which tells you that you’ve got massive performance issues to deal with. The SUT or System Under Test in this case consists of a large complex maze of technologies which consist of application servers, databases, messaging systems, caching systems including third party web services. As the Performance Engineering lead you are tasked with identifying the performance issues across the system. You are also responsible for raising the relevant defects and working with the development / build teams, identifying the root cause and supporting development of a fix.
The SUT as mentioned above has a complex workload that consists of business transactions executed by users who access the system from different time zones. The usage profile of the application peaks at different times of the day depending on the active time zone and scheduled monthly activity that needs to be performed using the application. It’s quite evident that the complexity of the system combined with a fluctuating workload makes identifying and addressing the bottleneck extremely cumbersome and time consuming. Identification of relevant bottleneck components and associated tuning parameters across the entire delivery chain which spans the numerous internal and external systems can be equated to locating a needle in a large and messy haystack.
Driving blind – System level tuning parameters for any system are generally large in number. Let’s take tuning of a Microsoft or Aix stack for example. Microsoft offers 112 pages of documentation titled “Performance Tuning Guidelines for Windows Server 2008 R2” (http://tinyurl.com/qx4v4gy). AIX operating system tuning guide by IBM is even larger – it has 744 pages (http://tinyurl.com/o3b66o8) from start to end. Similar is the situation with any other operating system or application container you might find.
Application tuning parameters control application demand for system resources as well as configurations of application internal logical objects like software threads, connection pools, etc. Application vendors publish comprehensive tuning documentation to help optimize their products. Here are a few examples of Oracle all-inclusive performance tuning publications: “Oracle® Fusion Middleware Performance and Tuning Guide” (http://tinyurl.com/kurmd9p),“Oracle® JRockit Performance Tuning Guide” (http://tinyurl.com/mggv55j), “Oracle® Fusion Middleware Performance and Tuning for Oracle WebLogic Server” (http://tinyurl.com/panje7g).
So given the complexity of the application stack and the fact that vendor tuning documentation can take a complete life time to read is it even practically feasible to envision the possibility of being able to nail down a performance issue, let alone identify a fix for it? In other words, can we conceptualize and Enterprise Architecture to abstract the relevant details and concentrate only on the objects that have a potential to create the bottlenecks? Can looking at less be the answer?
Looking at the bigger picture – The Enterprise Architecture complexity reduction process encompasses building of a mental model of the application. Wikipedia defines a mental model as an explanation of someone’s thought process about how something works in the real world (http://en.wikipedia.org/wiki/Mental_model). In this post we show how to build a mental model of EA that exposes the relations between demand for EA services and supply of EA resources.
By devising a mental model we are able to conceptually view different perspectives of the system. Each of the different perspectives helps uncovers application components, their interconnections, as well as transaction processing inside EA.
Mental model constructs – We need three constructs to build a mental model that serves our purpose:
- Nodes – represent hardware components processing user requests. The nodes symbolize servers, appliances, and networks.
- Interconnections among nodes – they stand for connections among hardware components. Interconnections and nodes define EA topology.
- Transactions – characterize user requests for EA services. If we visualize a transaction as a physical object (for example, a car), we can create in our mind an image of a car-transaction visiting different nodes and spending some time in each one receiving a service.
The model’s constructs associate with application objects as shown in a table below:
An application can be represented by a mental model as shown on the picture below
A transaction starts its journey when a user clicks on a menu item or a link that implicitly initiates a transaction. In a model it means that a transaction leaves a node “users”. After that it processed at the nodes “network” and “server”. At the end of its journey a transaction comes back to a node “users “. Total time a transaction has spent in the nodes “network” and “server” is a transaction response time.
How mental models help to identify bottlenecks – Let’s consider what can cause a delay in a processing of a car-transaction in a node. One obvious reason – a node does not have enough capacity when a number of car-transactions concurrently line up at the service center. In such a case some car-transactions will receive service, but the others will wait and have to be queued until the service center has freed up resources required to process the other transactions. Another reason for delay which is not quite obvious is as follows – For example, in order to process a transaction at the CPU, an application has to request and to receive a particular memory space. But what do you think might happen if memory is not available? Yes, the transaction will again queue for resources at the service center. In general, this fact means that transaction processing can be delayed as a result of a limited access to a node even if node is not fully utilized.
We came to important conclusion: transaction delay can be cause by two circumstances – shortage of a resource capacity and limited access to a resource.
Identifying bottlenecks – That means that in order to identify where the bottlenecks can potentially take place, we have to monitor all hardware resources that are processing transactions, as well as all objects providing access to resources. Among such objects there are the physical ones (like memory and disk space), as well as the logical programmatic constructs (like software threads, connection pools, locks, semaphores, etc).
The model on above picture suggests that the bottlenecks in our application could surface when there are insufficient CPU and I/O resources in hardware server, as well as when the server has limited memory resource, or the application spawns insufficient software threads or features poorly tuned connection pool at the database layer. Indeed, low network throughput also can cause the bottlenecks.
The model we’ve drawn has helped focus our bottleneck troubleshooting efforts into the right direction. We are now in a better position to deploy relevant monitors to collect server’s CPU and I/O performance counters. In addition we could also monitor server memory availability and the behaviour of the connection pools and software threads. We also will use network monitors to assess network latency and throughput.
The bottom line – the mental models expose application fundamentals distilled of innumerable application particulars that conceal the roots of performance issues.
From mental models to queuing models – EA mental models are indispensable instruments streamlining our performance troubleshooting activities. The mental models point to the facts that a bottleneck happens when a node does not have sufficient capacity or access to the node is limited. In both cases processing of a transaction will be delayed because transaction will be placed into a waiting queue.
Queuing is a major phenomenon defining EA performance, but mental models cannot quantitatively asses its impact on transaction times, as well as on EA architecture. If we want to find out an EA architecture delivering EA performance according to a Service Level Agreement, we have to transition from EA mental models to EA queuing models. The book I wrote sometime ago titled, “Solving Enterprise Applications Performance Puzzles: Queuing Models to the Rescue”, Wiley-IEEE Press; 1 edition, 2012; can help answer the questions which unfortunately can’t be completely understood or explained by the mental model.
Dr. Leonid Grinshpan (LinkedIn) is currently Technical Director at Oracle with a focus on Enterprise applications capacity planning, load testing, modelling, performance analysis, and tuning. Leonid has a few decades of experience on two complementing areas: computer science and information technology (IT) engineering. He holds a Ph.D in computer science is is also an author of a book on mathematical modelling. Leonid has worked on over 200 capacity planning and performance tuning projects for Fortune 500 customers over the past eleven years. Leonid is also the recipient of the highest award in the USSR for excellence in IT engineering – Award of the Counsel of Ministers of the USSR for design and implementation of an open CAD system for microprocessor based products. He tested out his entrepreneurial skills by co-founding the first Belarusian-American Joint Venture Software Security Belarus (acquired in 1997 by USA company Rainbow Technologies).