Tuesday, May 27, 2014

Performance Analysis - Part 4

When it comes to bottleneck isolation, first doubt the configurations/settings in web server, app server and db server. Top items you must check, before getting into code level, are:
  1. Webserver max connections
  2. jvm heap memory max settings
  3. Web server cache limit
  4. Database indexes for most frequently used queries
  5. Database max connections
  6. Swap file settings of the machine
The above one is not the complete list. But these are some key items, that most of us will forget to check. Once these things are resolved, we need to get into the application code level. But a performance tester may not know all the details in code. No worries. There are tools, that can get into the code and tell you where is the problem.

These tools are called APM (application performance monitoring) tools. There will be an agent and we must install that in our servers. These agents will send finer details to the tool's central database, and that will show the problem areas.

Look at the following image.






This is taken from the profiling tool. These tools run on servers and collect deep-dive data. The load testing tools will give stats for the pages. If you look at this image, it has jsp details as well as details for LoginAction, LookupAction, TaxAction etc. The Action is a java code that is done at the server level. When load test is going on, this will measure the time taken by server side functions, and provide details.

From this, we can see main.jsp and LoginAction are the most time consuming ones, that the others. So we must get deep into those and find the issues. To drill down a top level function is known as transaction trace. 




Look at this transaction trace.



This trace clearly tells what are the calls made within the LoginAction, in the same call sequence. From this we know how many java calls are made and how many database calls are made. How many times the calls are made and the time taken for the calls are clearly provided.

Come on, what else you need to identify the bottleneck? Go thru transaction by transaction, go thru the trace and find the function calls or database calls that take the maximum time. Tell your dev team to touch those functions for better performance.

How do you find this? If you say that this is not interesting, your emotion levels are to be checked!
 

Monday, March 31, 2014

Performance Testing - Analysis - Part 3

One may get chest pain as a symptom, the immediate cause may be a block in blood vessel, but the root cause may be high level of cholesterol. Performance analysis must aim to go to the root cause. Remember - Performance analysis does not depend on just 1 parameter. We need to start questioning every single number that goes out of our limits. In the last part, we had narrowed down to a point - degradation started at around 25th or 30th minute and when 50 odd users were in the system. A slowness is  definitely related to some of the system resources directly or indirectly. Some part of the code may use more cpu or memory or network or disk; and that will clog the entire system.

Let us examine this graph.

You can clearly see, that by 25th minute, one of the server's memory usage has gone beyond 95% of the total capacity. From that point, till the end, it remained the same way. Now we know that shortage of memory in that server, caused slowness from 25th minute onwards. So, any program that was deployed in that server must be looked into. During load testing, all programs are not running. We know what beans/servlets etc are used for the scenario we tested, so we must start looking at those immediately.

Look at this graph.


Process runnable per cpu in Linux, is the queue length. This means those many processes were waiting to grab the cpu in that machine. The longer the queue is, the longer the wait time is. We all know that fact. This graph also indicates from around 25th minute onwards, the queue length has gone beyond threshold.

Both graphs clearly indicate, that consumption gradually increases. It does not get released after usage and hence it just piles up. At some point, system will get exhausted and then we will see issues cropping up.

So, now we know, it is memory consumption that definitely caused issues. Who exactly consumed memory? Good question. Will tell that in the next post.

Tuesday, March 11, 2014

Performance Testing - Analysis - Part 2

Performance bottleneck analysis has a lot of similarities with crime investigation! We need to start with some messy area and then investigate things around that. Once we zero in on a set of pages that have slowed down, the next step is to identify, at what intervals the slowdown has happened. All tools provide the response time vs elapsed time graph. We need to see the slow pages's response time, one by one, to get the time intervals. 

See this picture.

From the above we can note that, till the 50th minute, response was OK; at that time around 85-90 users were running; From that point onward, we see performance degradation. So, we have one clue that from 50th minute, problems are seen in speed. It is not enough to stop with this alone. Usually more than one symptom is required to prove a point. Let us take the hits to the server. See this picture.


Upto 50th minute, there were more hits going on to the server. From that point, hits count came down. So, 2 things point in the same direction. Now it is time for us to analyze what happened around the 50th minute. The dev  team or server maintenance team, must start looking at all server logs - web server log, app server log, db log to see what happened at the 50th minute from the start of the run. Look for errors, exceptions. We must get something - we cannot say what we will get, but I am sure we will get some strange messages coming out of the application or web server at that point of time.

If you see first and 2nd graphs together, you can see one thing. Beyond 50 users, after 30th minute or so, the hits did not increase, in spite of more users getting into the system. The hits started coming down. So, the issue might have started even before the 50th minute.

Interesting, isn't it? We will continue in the next post.

Thursday, February 13, 2014

Performance Testing - Analysis - Part 1

Let us assume that we have executed 1000 virtual users on our web application and collected all required performance counters details as well. Now it is time to analyze. A performance tester, must be in a position to go very close to the problem point, so that the development team can fix the performance issues easily. If we simply throw so many details to dev team, it will not help any one. Analysis of performance test results is a separate field by itself. A multi-billion dollar field.

When you see a performance test result report, generated by the tool, first look for any errors. Usually tools will list out http 400 series errors or 500 series errors. The errors may happen randomly or may happen for  longer period of time. Every such error must be explained and fixed - there is no second thought on this. If a page returns http 404 error, dev team must either remove that reference or copy the actual file in proper location. When http 500 series errors do happen, tools provide the response text that came from server side. That must give a clue on what happened. If that is not sufficient, developers need to look at the webserver log files for any exceptions, database logs for any exceptions or app server log or application's custom log files. 

If one sees no issues on server logs, it may most probably be because of a wrong data passed to the pages. Hence tester needs to check what data was passed to that particular vuser, at that specific time - tool will provide that. Never ignore even a single error. One error may cause ripple effect and can lead to multiple failures. 

The next step is to identify the slow transactions. For this, we need to define what is slow. Is it 7 seconds or 9 seconds? Is it 9 seconds for all pages? Tough to say. But there are companies like Compuware that publish industry benchmarks for various industries such as travel, banking, media etc. Check those benchmark response times and compare your page responses. Whichever page that are above the benchmark level, definitely need fine tuning. So, first mark such pages that have response time more than your performance goals or industry benchmark or both.

A page with response time more than your goal of 7 seconds (assumption), may not be always like that. It might have responded in 3 seconds at some point and might have responded in 12 seconds at some point. It is also better to take 90th percentile average response time to compare, instead of simple average. The 90th percentile means the average response time of 90 percent of your transactions. Usually tools will remove the highest 5% and lowest 5% response time and arrive at this - but see respective tool's help file on how they arrive at this 90th percentile average. Our aim is to find the time at which the page responded slow. Some times, the page may be slowing down after 30 minutes from the start of the run. Usually, tool reports will have elapsed time as X axis. From elapsed time, find out at what time windows, the page has slowed down.

Then check whether all pages have slowed down during that time window or only this page that has slowed down. By now, you would have narrowed down to slow responding pages, and slow response time windows.

What to do next? We will see that in our next post.


Monday, January 6, 2014

Performance Testing - Performance Counters

When the servers are loaded with 100s of virtual user hits, the application response will slow down. This slowness is caused by a variety of factors. If we need to improve the speed of the app, we need to know exactly which causes the slowness. We must understand that the internet app will have web servers, app servers, database servers, load balancers, proxies, firewalls, network etc. along the chain. A single bad component can pull down the response of the whole app. To identify the exact place of slowness, we must rely on performance counters.

Imagine a total health check. Our whole body will undergo so many tests. Weight, heart beat rate, blood pressure, cholesterol levels, sugar levels, RBC count, WBC count, treadmill test results etc. will all be taken and analyzed by the doctor. each part of our body is an object and each object has many measurements based on tests, and these are translated to numbers. When all these details are translated to numbers, it is easy to compare and isolate problem areas. Now, treat each object in the entire web application chain as health components and measure it. That means, we must monitor server health!

Each machine or device will have cpu, memory, disk and network connection to transfer data in and out. If we measure these objects in every machine, we will be in a better position to analyze. Every load testing tool also provides performance counter collection component along with the tool. There are other independent tools as well, available in the market. Usually we start collecting these performance object counters from 10 minutes before the test run, collect data when test is in progress, keep collecting the data until 10 minutes after the test is complete. The data will be usually collected every 5 or 10 seconds.

What to measure in CPU? There are 100s of items to be monitored; this list has only the vital counters. % cpu usage, % cpu used by system, % cpu used by user application, number of processes waiting in queue to grab cpu.

What to measure in memory? % memory in use, page faults, swap ratio, cache hits.

What to measure in disk? Number of disk read/sec, number of disk writes/sec, read/write errors, disk queue length.

What to measure in network? Number of packets sent, number of packets received, packet errors, available bandwidth, tcp retransmissions, network queue length.

The above counters must be collected from all servers that are part of the application environment. Over and above these, a lot of specific counters are available and they must also be collected, after consulting with respective system/server admins.

Apart from hardware related counters, we must also collected software related counters. For example, if you use Apache Tomcat, we must collected a few counters such as number of active sessions, number of active connections, cache hit ratio, memory used by webserver, pages cached etc. When you use RDBMS, we must collect counters such as number of active connections to db, index usage percentage, number of waited locks, number of nowait locks, number of open tables, reads per second, writes per second, number of open cursors etc.

This means, we need to collect 100s of such measurements for every run, and analyze after the run. If these are not collected, we cannot isolate where the problem exists. How to identify exact bottleneck - we will see in the next post.

For high end load testing tool, visit http://www.floodgates.co.in

For free video lessons on load testing, visit  http://www.openmentor.net.