Monday, March 31, 2014

Performance Testing - Analysis - Part 3

One may get chest pain as a symptom, the immediate cause may be a block in blood vessel, but the root cause may be high level of cholesterol. Performance analysis must aim to go to the root cause. Remember - Performance analysis does not depend on just 1 parameter. We need to start questioning every single number that goes out of our limits. In the last part, we had narrowed down to a point - degradation started at around 25th or 30th minute and when 50 odd users were in the system. A slowness is  definitely related to some of the system resources directly or indirectly. Some part of the code may use more cpu or memory or network or disk; and that will clog the entire system.

Let us examine this graph.

You can clearly see, that by 25th minute, one of the server's memory usage has gone beyond 95% of the total capacity. From that point, till the end, it remained the same way. Now we know that shortage of memory in that server, caused slowness from 25th minute onwards. So, any program that was deployed in that server must be looked into. During load testing, all programs are not running. We know what beans/servlets etc are used for the scenario we tested, so we must start looking at those immediately.

Look at this graph.

Process runnable per cpu in Linux, is the queue length. This means those many processes were waiting to grab the cpu in that machine. The longer the queue is, the longer the wait time is. We all know that fact. This graph also indicates from around 25th minute onwards, the queue length has gone beyond threshold.

Both graphs clearly indicate, that consumption gradually increases. It does not get released after usage and hence it just piles up. At some point, system will get exhausted and then we will see issues cropping up.

So, now we know, it is memory consumption that definitely caused issues. Who exactly consumed memory? Good question. Will tell that in the next post.

Tuesday, March 11, 2014

Performance Testing - Analysis - Part 2

Performance bottleneck analysis has a lot of similarities with crime investigation! We need to start with some messy area and then investigate things around that. Once we zero in on a set of pages that have slowed down, the next step is to identify, at what intervals the slowdown has happened. All tools provide the response time vs elapsed time graph. We need to see the slow pages's response time, one by one, to get the time intervals. 

See this picture.

From the above we can note that, till the 50th minute, response was OK; at that time around 85-90 users were running; From that point onward, we see performance degradation. So, we have one clue that from 50th minute, problems are seen in speed. It is not enough to stop with this alone. Usually more than one symptom is required to prove a point. Let us take the hits to the server. See this picture.

Upto 50th minute, there were more hits going on to the server. From that point, hits count came down. So, 2 things point in the same direction. Now it is time for us to analyze what happened around the 50th minute. The dev  team or server maintenance team, must start looking at all server logs - web server log, app server log, db log to see what happened at the 50th minute from the start of the run. Look for errors, exceptions. We must get something - we cannot say what we will get, but I am sure we will get some strange messages coming out of the application or web server at that point of time.

If you see first and 2nd graphs together, you can see one thing. Beyond 50 users, after 30th minute or so, the hits did not increase, in spite of more users getting into the system. The hits started coming down. So, the issue might have started even before the 50th minute.

Interesting, isn't it? We will continue in the next post.