Wednesday, April 23, 2008

Capacity Planning - Sizing a Production Database Server

The process of designing a database server infrastructure to cater the needs of the application or business before it goes live in to a production environment is known as sizing.
It requires a lot of skilled requirement collection and analysis to project the requirements of the application well before it is deployed. Following are a few generic techniques that are being use to size a database server infrastructure.

§ Capacity Planning Interviews
One of the best techniques to understand an application well before it is deployed is through the interactions with the stakeholders. The stakeholders can be anyone who is going to be using the application; the business analysts who did the system requirements collection and analysis, the technical team who designed and developed the application, or the end users who will use the business. We need to interact with various teams to understand and collect as much as information regarding the application architecture and its usage. Following are the different aspects we need to understand with the interviews.

Ø Business Requirements
Business requirements can be collected by meeting and interacting with the business users and their leaders. We need to understand the nature of the application, who will be using the application, how frequently will the users be using the application, how critical is the application, how many users will be concurrently using the application, how complex would be the usage nature of the application like whether there will be simultaneous data input processes ad data analysis process, whether there will be additional employees joining who will be using the application etc.
The business users requirements is the most important factor that has to be understood while doing a capacity planning. However the application is built and maintained, ultimately it is the application usage pattern that determines the performance and availability of the application.

Ø Technical Requirements
While the business requirements of the application can be obtained from the pure business people, the technical requirements should be collected from the technical people that represent the application. The technical requirements that have to be collected include such as database response time, application uptime, the maintenance windows, the expected performance benchmarks, data growth patterns etc. These requirements have to be collected and documented, so that we can size the infrastructure that meets the requirements.

Ø Technical Environment
Collecting information regarding technical environment of the application comes into the picture in business models where we will not be hosting the infrastructure. In such cases, the environment is equally important as the business requirements collection. This is because we need to understand the technical environment where the database application will be going to run. This includes the network infrastructure, the security infrastructure, how efficient is the client’s system support teams are, do they have proper storage infrastructure etc.
We need to understand whether the company normally supports a staging or implementation server along with the production server. In such a case, any change can be applied and tested on the staging environments before being applied on the production server.

§ Collecting Performance Data and Baselines
This is perhaps the most important step in doing a capacity planning. Having the proper baselines in place is one of the main things people tend to forget or ignore when a new application is being deployed. Rather than collecting the baselines, the main focus those times will tend to have the application satisfy all the functional requirements and gets live in the production.
It is important to have the baselines in place because in the absence of a proper set of baselines and workload analysis data, once the application goes live, we will be ignorant on how the application is expected to behave under various workload and environments. Workload analysis can be defined as the process of collecting information on the expected behavior of the application under various workload and environments. If there is a performance issue notified on a production application, the first thing we can do is to take the baseline under that workload and compare the information with the current response. Thus we can easily detect the bottleneck and resolve accordingly.
Workload analysis requires three main components. They are the prototype server infrastructure, the load simulating tools, and the workload monitoring tools. Each of these three components is explained below.

Ø Prototype Server Infrastructure
A prototype infrastructure is required to be in place that exactly matches, with the production environment where the application is planning to be deployed. The details that are required to simulate the production infrastructure have to be obtained from the client during the requirements analysis phase. Accordingly, a proper prototype has to be established on which the workload analysis can be done. We can see that there are numerous occasions where a matching prototype could not be established due to budget constraints. It doesn’t make sense to compare the results obtained from an environment that doesn’t match with the production environment. Any analysis or tuning should be based on matching environments only.
A few factors that have to be considered while matching the environments are the following –
Hardware configuration
Processor power
Memory configuration
Network infrastructure
Storage configuration
Operating Systems
Operating systems version and patch level
Server build
Application implementation
Tier configurations
Database Instance configurations
For well maintained production environments, there will be proper documentation of infrastructure configurations. Any new infrastructure change, or a new device or asset introduction will be based on a predefined plan and documentation. The documentation include such details as what is the production standard for operating system and its edition and version etc, what is the procedure to build and harden any production server, what is the standard hardware that will be used for a typical production server, what will be the storage configuration for a database server etc. These documentations can be used to replicate the environment for the prototype infrastructure.

Ø Workload Simulation
Once the prototype infrastructure is in place, the next step we have to do is to simulate the workload on the application. For a typical multi user database application, we can avail the help of various tools available for workload simulation. A few examples are Loadrunner, Erwin etc.
Simulation tools can be used to simulate various conditions on the applications. These conditions can be applied on the hardware side as well as on the application usage side. Initially, we need to define a fixed number of environments to be simulated. Then we should define the hardware and software conditions that come under any environment. The response of the application and the infrastructure should be measured against each of these simulated environments. These environments and the response to each of these environments make our performance baseline data. Baselines can be defined as performance measurements gathered over a specific periods of time with specific sampling intervals. We need to ensure that we define a large number of possible environments and performance baseline data is collected against each of these varied environments.

Ø Response Monitoring & Analysis
The application response to the various predefined environments should be monitored and documented. The main purpose of having the monitor logs in place is to have an official documentation for current and future references. The logs display the undisputable track record of the application’s response over various environments. Any SLA or other agreements should use these track records as reference to conclude the response capability of the application and the ability to sustain various pressures and adverse environments.
An example of a load response results for an application can be as shown below.
ConcurrentUsers ProcessorUtilization
10,000 40%
20,000 50%
30,000 60%
40,000 70%
50,000 80%
From the figure, we can see easily make out the relation between the number of concurrent users with processor load. The document shows the “fact” that if the total number of users is less that 10,000, we should not expect a processor utilization of 80% unless and until there is something wrong somewhere. This gives the management the strength to commit the client on the response level of the application, as well as give the technical team a baseline reference to find out any processor related application bottlenecks.
There are various tools that can be used for this kind of monitoring purpose. One of the main tools that can be used for this purpose is the in-built Windows Performance Monitor (Perfmon). Perfmon can be used to track the application response to various software and hardware counters. The tracked response levels can be used to project the response and growth trend of the application with respect to time and load, and used as the main tool for post-configuration capacity planning process. The upcoming sessions explain the monitoring and analysis using such counters specific to physical memory, processor subsystem, disk subsystem etc.
Ø Projecting the Trends
Projection is one of the main processes involved in post-configuration capacity planning. This is explained here because it comes as a continuation and makes use of the monitor logs that we obtained from the load simulation environments.
Projection is the process of estimating the response level of the application with the help of already existing load response logs. In the above mentioned example of users vs. processor utilization, we have information on maximum of 50,000 concurrent users. Assume that the client wants and plans to have 10,000 more analysts to use the application. We can create the trend line from the already existing information as follows.

The graph shows us that the additional users will increase the processor subsystem utilization to 90% of the total processor capacity. This information can help the stakeholders to plan and take decisions that help them effectively use the application for their business.

Linear Growth and Geometric Growth
In the above example, we assumed that there will be a direct relationship between the number of concurrent users and the processor utilization. Such a relationship can be termed as a Linear Projection. It may not be always the case. We need to think that there may be increased amount of data loaded with the additional number of users. This means that there may be an exponential relationship between the users and the processor utilization. Consider the following example where we have the data growth for the application with respect to periods.

Period Total Data
200701 100
200702 110
200703 121
200704 133
200705 146
200706 161
200707 177
200708 195
200709 214
200710 236
200711 259
We can see that the data size is increasing with a pattern of d+ d*(10/100), where d is the previous period’s data size. When we need to project the disk requirements for period 200712, we can easily calculate as 259 + (259*.1), and that is approximately 285. Here, the growth is a geometric growth. The projection will be as follows –













§ Final Decision on the Production Server Infrastructure
The final decision on the database server infrastructure is based upon the performance data and baselines we collected in the performance testing phase. When we share the performance test results with the client, the client will share the required parameters that have to be satisfied by the application like number of concurrent users, uptime requirements etc. based upon their requirements, we can finalize whether the hardware we have for the prototype satisfies the requirements. In the above example, assuming that the technical team of the client agrees that the maximum processor at any time can be up to 50%, it is clear that the processor subsystem of the prototype infrastructure can be sufficient if the number of analysts accessing the application is less than 30,000. In case, either the client business team wants their concurrent users more than 30,000 or if the client technical team wants the maximum processor utilization less than 40% or so, we need to reengineer our processor subsystem to increase the strength. Otherwise, we can easily decide that the processor subsystem is sufficient for the time being.
The final decisions can be so easy always. There will be a lot of parameters to be considered. For example, the client may require that at any time the application should have the past two years’ data. In such cases, the load simulating environments should consider the processor overhead that would be caused by the additional data. In short, the simulated environments should be smart enough to satisfy the forecasted environments for the specific period (like two years, three years etc).

Ø Finalizing the SLAs
The SLA is mutual agreement with the client and the solution provided agreeing that the application will satisfy the predefined requirements. The client technical team will insist on such agreements such as 99.99% uptime, a round trip response time of less than 1 second etc. The final decision on the infrastructure should be taken so that the SLAs can be met. Though the SLAs will come as part of the technical requirement specifications, it is possible that business and budget reasons will force to have adjustments on them once the prototype testing is done. Until the SLAs are met on the prototype infrastructure, the application should be fine tuned.

No comments: