One may get chest pain as a symptom, the immediate cause may be a block in blood vessel, but the root cause may be high level of cholesterol. Performance analysis must aim to go to the root cause. Remember - Performance analysis does not depend on just 1 parameter. We need to start questioning every single number that goes out of our limits. In the last part, we had narrowed down to a point - degradation started at around 25th or 30th minute and when 50 odd users were in the system. A slowness is definitely related to some of the system resources directly or indirectly. Some part of the code may use more cpu or memory or network or disk; and that will clog the entire system.
Let us examine this graph.
You can clearly see, that by 25th minute, one of the server's memory usage has gone beyond 95% of the total capacity. From that point, till the end, it remained the same way. Now we know that shortage of memory in that server, caused slowness from 25th minute onwards. So, any program that was deployed in that server must be looked into. During load testing, all programs are not running. We know what beans/servlets etc are used for the scenario we tested, so we must start looking at those immediately.
Look at this graph.
Process runnable per cpu in Linux, is the queue length. This means those many processes were waiting to grab the cpu in that machine. The longer the queue is, the longer the wait time is. We all know that fact. This graph also indicates from around 25th minute onwards, the queue length has gone beyond threshold.
Both graphs clearly indicate, that consumption gradually increases. It does not get released after usage and hence it just piles up. At some point, system will get exhausted and then we will see issues cropping up.
So, now we know, it is memory consumption that definitely caused issues. Who exactly consumed memory? Good question. Will tell that in the next post.