Tuesday, May 27, 2014

Performance Analysis - Part 4

When it comes to bottleneck isolation, first doubt the configurations/settings in web server, app server and db server. Top items you must check, before getting into code level, are:
  1. Webserver max connections
  2. jvm heap memory max settings
  3. Web server cache limit
  4. Database indexes for most frequently used queries
  5. Database max connections
  6. Swap file settings of the machine
The above one is not the complete list. But these are some key items, that most of us will forget to check. Once these things are resolved, we need to get into the application code level. But a performance tester may not know all the details in code. No worries. There are tools, that can get into the code and tell you where is the problem.

These tools are called APM (application performance monitoring) tools. There will be an agent and we must install that in our servers. These agents will send finer details to the tool's central database, and that will show the problem areas.

Look at the following image.

This is taken from the profiling tool. These tools run on servers and collect deep-dive data. The load testing tools will give stats for the pages. If you look at this image, it has jsp details as well as details for LoginAction, LookupAction, TaxAction etc. The Action is a java code that is done at the server level. When load test is going on, this will measure the time taken by server side functions, and provide details.

From this, we can see main.jsp and LoginAction are the most time consuming ones, that the others. So we must get deep into those and find the issues. To drill down a top level function is known as transaction trace. 

Look at this transaction trace.

This trace clearly tells what are the calls made within the LoginAction, in the same call sequence. From this we know how many java calls are made and how many database calls are made. How many times the calls are made and the time taken for the calls are clearly provided.

Come on, what else you need to identify the bottleneck? Go thru transaction by transaction, go thru the trace and find the function calls or database calls that take the maximum time. Tell your dev team to touch those functions for better performance.

How do you find this? If you say that this is not interesting, your emotion levels are to be checked!

Monday, March 31, 2014

Performance Testing - Analysis - Part 3

One may get chest pain as a symptom, the immediate cause may be a block in blood vessel, but the root cause may be high level of cholesterol. Performance analysis must aim to go to the root cause. Remember - Performance analysis does not depend on just 1 parameter. We need to start questioning every single number that goes out of our limits. In the last part, we had narrowed down to a point - degradation started at around 25th or 30th minute and when 50 odd users were in the system. A slowness is  definitely related to some of the system resources directly or indirectly. Some part of the code may use more cpu or memory or network or disk; and that will clog the entire system.

Let us examine this graph.

You can clearly see, that by 25th minute, one of the server's memory usage has gone beyond 95% of the total capacity. From that point, till the end, it remained the same way. Now we know that shortage of memory in that server, caused slowness from 25th minute onwards. So, any program that was deployed in that server must be looked into. During load testing, all programs are not running. We know what beans/servlets etc are used for the scenario we tested, so we must start looking at those immediately.

Look at this graph.

Process runnable per cpu in Linux, is the queue length. This means those many processes were waiting to grab the cpu in that machine. The longer the queue is, the longer the wait time is. We all know that fact. This graph also indicates from around 25th minute onwards, the queue length has gone beyond threshold.

Both graphs clearly indicate, that consumption gradually increases. It does not get released after usage and hence it just piles up. At some point, system will get exhausted and then we will see issues cropping up.

So, now we know, it is memory consumption that definitely caused issues. Who exactly consumed memory? Good question. Will tell that in the next post.

Tuesday, March 11, 2014

Performance Testing - Analysis - Part 2

Performance bottleneck analysis has a lot of similarities with crime investigation! We need to start with some messy area and then investigate things around that. Once we zero in on a set of pages that have slowed down, the next step is to identify, at what intervals the slowdown has happened. All tools provide the response time vs elapsed time graph. We need to see the slow pages's response time, one by one, to get the time intervals. 

See this picture.

From the above we can note that, till the 50th minute, response was OK; at that time around 85-90 users were running; From that point onward, we see performance degradation. So, we have one clue that from 50th minute, problems are seen in speed. It is not enough to stop with this alone. Usually more than one symptom is required to prove a point. Let us take the hits to the server. See this picture.

Upto 50th minute, there were more hits going on to the server. From that point, hits count came down. So, 2 things point in the same direction. Now it is time for us to analyze what happened around the 50th minute. The dev  team or server maintenance team, must start looking at all server logs - web server log, app server log, db log to see what happened at the 50th minute from the start of the run. Look for errors, exceptions. We must get something - we cannot say what we will get, but I am sure we will get some strange messages coming out of the application or web server at that point of time.

If you see first and 2nd graphs together, you can see one thing. Beyond 50 users, after 30th minute or so, the hits did not increase, in spite of more users getting into the system. The hits started coming down. So, the issue might have started even before the 50th minute.

Interesting, isn't it? We will continue in the next post.

Thursday, February 13, 2014

Performance Testing - Analysis - Part 1

Let us assume that we have executed 1000 virtual users on our web application and collected all required performance counters details as well. Now it is time to analyze. A performance tester, must be in a position to go very close to the problem point, so that the development team can fix the performance issues easily. If we simply throw so many details to dev team, it will not help any one. Analysis of performance test results is a separate field by itself. A multi-billion dollar field.

When you see a performance test result report, generated by the tool, first look for any errors. Usually tools will list out http 400 series errors or 500 series errors. The errors may happen randomly or may happen for  longer period of time. Every such error must be explained and fixed - there is no second thought on this. If a page returns http 404 error, dev team must either remove that reference or copy the actual file in proper location. When http 500 series errors do happen, tools provide the response text that came from server side. That must give a clue on what happened. If that is not sufficient, developers need to look at the webserver log files for any exceptions, database logs for any exceptions or app server log or application's custom log files. 

If one sees no issues on server logs, it may most probably be because of a wrong data passed to the pages. Hence tester needs to check what data was passed to that particular vuser, at that specific time - tool will provide that. Never ignore even a single error. One error may cause ripple effect and can lead to multiple failures. 

The next step is to identify the slow transactions. For this, we need to define what is slow. Is it 7 seconds or 9 seconds? Is it 9 seconds for all pages? Tough to say. But there are companies like Compuware that publish industry benchmarks for various industries such as travel, banking, media etc. Check those benchmark response times and compare your page responses. Whichever page that are above the benchmark level, definitely need fine tuning. So, first mark such pages that have response time more than your performance goals or industry benchmark or both.

A page with response time more than your goal of 7 seconds (assumption), may not be always like that. It might have responded in 3 seconds at some point and might have responded in 12 seconds at some point. It is also better to take 90th percentile average response time to compare, instead of simple average. The 90th percentile means the average response time of 90 percent of your transactions. Usually tools will remove the highest 5% and lowest 5% response time and arrive at this - but see respective tool's help file on how they arrive at this 90th percentile average. Our aim is to find the time at which the page responded slow. Some times, the page may be slowing down after 30 minutes from the start of the run. Usually, tool reports will have elapsed time as X axis. From elapsed time, find out at what time windows, the page has slowed down.

Then check whether all pages have slowed down during that time window or only this page that has slowed down. By now, you would have narrowed down to slow responding pages, and slow response time windows.

What to do next? We will see that in our next post.

Monday, January 6, 2014

Performance Testing - Performance Counters

When the servers are loaded with 100s of virtual user hits, the application response will slow down. This slowness is caused by a variety of factors. If we need to improve the speed of the app, we need to know exactly which causes the slowness. We must understand that the internet app will have web servers, app servers, database servers, load balancers, proxies, firewalls, network etc. along the chain. A single bad component can pull down the response of the whole app. To identify the exact place of slowness, we must rely on performance counters.

Imagine a total health check. Our whole body will undergo so many tests. Weight, heart beat rate, blood pressure, cholesterol levels, sugar levels, RBC count, WBC count, treadmill test results etc. will all be taken and analyzed by the doctor. each part of our body is an object and each object has many measurements based on tests, and these are translated to numbers. When all these details are translated to numbers, it is easy to compare and isolate problem areas. Now, treat each object in the entire web application chain as health components and measure it. That means, we must monitor server health!

Each machine or device will have cpu, memory, disk and network connection to transfer data in and out. If we measure these objects in every machine, we will be in a better position to analyze. Every load testing tool also provides performance counter collection component along with the tool. There are other independent tools as well, available in the market. Usually we start collecting these performance object counters from 10 minutes before the test run, collect data when test is in progress, keep collecting the data until 10 minutes after the test is complete. The data will be usually collected every 5 or 10 seconds.

What to measure in CPU? There are 100s of items to be monitored; this list has only the vital counters. % cpu usage, % cpu used by system, % cpu used by user application, number of processes waiting in queue to grab cpu.

What to measure in memory? % memory in use, page faults, swap ratio, cache hits.

What to measure in disk? Number of disk read/sec, number of disk writes/sec, read/write errors, disk queue length.

What to measure in network? Number of packets sent, number of packets received, packet errors, available bandwidth, tcp retransmissions, network queue length.

The above counters must be collected from all servers that are part of the application environment. Over and above these, a lot of specific counters are available and they must also be collected, after consulting with respective system/server admins.

Apart from hardware related counters, we must also collected software related counters. For example, if you use Apache Tomcat, we must collected a few counters such as number of active sessions, number of active connections, cache hit ratio, memory used by webserver, pages cached etc. When you use RDBMS, we must collect counters such as number of active connections to db, index usage percentage, number of waited locks, number of nowait locks, number of open tables, reads per second, writes per second, number of open cursors etc.

This means, we need to collect 100s of such measurements for every run, and analyze after the run. If these are not collected, we cannot isolate where the problem exists. How to identify exact bottleneck - we will see in the next post.

For high end load testing tool, visit http://www.floodgates.co.in

For free video lessons on load testing, visit  http://www.openmentor.net.

Tuesday, December 10, 2013

Performance Testing - Load Generators

In today's context, 100 vusers is the minimum load expectation for any web application. If you want to test for just 100 or 200 users, you just need one machine to generate the load. Each vuser will in turn be a thread or process, running in background, in the same machine where your load testing tool is installed. This machine is usually called controller. Each thread/process will occupy 1MB to 20MB memory space, based on script size and data size and each will consume some amount of cpu and disk. When I need to run 2000 vusers, one single machine is not enough to generate load. Here we need to distribute the load generation process itself.

Imagine each vuser consuming 5MB memory space; if we run 1000 vusers, we will require 5GB just for the vusers alone; over and above OS and other software will consume memory. If we have a machine with 4GB memory, we cannot run 1000 vu from that machine; because, that load generating machine itself will crash. Also, when responses for all 1000 vusers are sent to the same machine, that network port of that machine will choke. Every tool has a facility to generate load from different machines. These are called load generators or load agents. From the controller machine, these load generator machines must be accessible via LAN. A small program needs to be installed on all these machines, called remote agent process.

From controller, we must specify how many users are to be executed from each of the load generator machines. If our total user count is 1000, we can specify 300 from load-gen-1, 400 from load-gen-2 and 300 from load-gen-3. The target server for which we do the load test, must be accessible from all load gens; else scripts will fail. Once the load is distributed to all load gens, when the run starts, the tool will send the scripts to the load generators and instruct those to start the vuser thread/processes in those load gens. Hence, the memory and cpu of the controller will not be consumed. Every 5 or 10 seconds, the status will be sent back to the controller from all load gens.

This helps us in 2 ways. First, it helps us to run large number of vusers using multiple regular desktops/laptops, without a need for high end machines, for the sake of generating load. Second, we can run tests on target server, from remote machine other than where the tool license is installed. You may be in New York, USA, target server may be in Ireland and tool load generator may be in Los Angeles, USA. So, this helps us to do a load test with load being generated at a different geography, not from the same place where the server is installed.

For high end load testing tool, visit http://www.floodgates.co.in

For free video lessons on load testing, visit  http://www.openmentor.net.

Monday, November 18, 2013

Performance Testing - Configure vuser count, duration

Executing performance tests is a relatively easier task than doing scripting. Because, the tool is going to do more work and the tester needs to do just a set of configurations. The 2 key configurations are user count and duration. A performance test will not usually have just 1 script running; rather a set of scripts will be executed in parallel as a combined scenario. This is to reflect different sets of users doing different operations on the same server. So it is very important for us to do proper configuration before hitting the start button.

Recollect our first few lessons in load test planning. We identify a set of most frequently used scenarios and identify their priorities. We may want to run 1000 virtual users, but how to distribute 1000 virtual users across different scripts? It is better to get stats from both business team and the webserver admin team. They can tell the historic usage of the transactions. In a banking scenario, we may see x% of users doing balance inquiry, y% doing deposits, z% doing withdrawals, etc. Usually the business team can provide how many deposits happened in last quarter/month in terms of number of transactions, number of withdrawals, number of balance inquiry, number of utility bill payments etc. From that number, we can arrive at the % of transactions for that activity. If total transactions are 100,000 and deposits are 12500, we can say deposit transaction has 12.5% consumption of total transactions with the server and so on.

We now have to fix the total duration of the run. We usually try to run scenarios at least for 1 hour with all users in peak load. It is better to run for a longer duration to get better statistics. It also ensures the reliability and consistency of the servers and apps. If a branch of a bank works from 9 to 3, better we run our tests for 3 hours (half of it). Again this is our way of planning; different consultants suggest anywhere between 25% to 75% of the total duration of the office hours.

But there is one important aspect on releasing virtual users to hit the server. If I need to run 1000 vu, all 1000 vu will not start at the same time and hit the server. In real life, crowd slowly builds up - both on roads as well as on web. Hence we need to slowly ramp up the user count, rather than doing a big bang. If I need to run 1000 users for 3 hours (180 minutes) at peak load, what is the time I must keep in mind for user ramp up? We usually suggest 80:20 principle. Take 20% of the total peak load duration, and allocate that for ramp-up. Thus to run a scenario for 180 minutes, I may allow 30-35 minutes for users to ramp up and then run 180 minutes at peak. This means, the test will run for 35+180 minutes. Some companies try to include the ramp-up time within total duration and some do not. It does not really affect in a big way.

If 1000 users need to ramp-up in 35 minutes, how to release new users to the load pool? You can either evenly distribute or release in batches. If I need to evenly distribute, I can release 1 user every 2 seconds, and that will give 1000 users at the end of 2000th second. This means, first 1 user will start, after 2 seconds one more user will get added, after another 2 seconds another user will get added and so on. The other way is releasing in batches. Release 30 users every minute. This is purely a subjective decision and it will vary from project to project. In an online examination scenario, all users will ramp-up within 5 minutes, even though the exam duration is 2 or 3 hours.

For free lessons on automation tools, visit us at http://www.openmentor.net.