How to forecast the performance of a database
Data requirement and way of plotting the gathered data
Phases of forecasting
Proven Forecast Models
Scalability and forecasting
1. Introduction: In the area of database performance the most frequently asked question is:
“If the system workload doubles, can the system handle the load?” or may be “If we upgrade our CPUs, add 20 more IO devices,and increase the workload by 20%, what will happen to response time, CPU utilization, and average IO utilization?”
To answer those questions we must follow some method. Here I am trying to formulate such a method where the key metrics for forecast are:
- queue length,
- and response time.
2. Data requirement and way of plotting the gathered data:
To get the values, you essentially need 3 things:
• a few simple formulas
• some basic operating system statistics
• some basic database statistics
definitions and symbols:
S : Time to service one workload unit. This is known as the service time or service
demand. It is how long it takes the CPU to service a single transaction. For
example, 1.5 seconds per transaction or 1.5 sec/trx. The best way to get the value
for Oracle systems is to simply derive it.
U : Utilization or CPU busyness. It’s commonly displayed as a percentage and that’s
how it works in this formula. For example, in the formula it should be something like
75% or 0.75, but not 75. A simple way to gather CPU utilization is simply running
sar –u 60 1. This will give you the average CPU utilization over a 60 second period.
λ : Workload arrival rate. This is how many transactions enter the system per unit
of time. For example, 150 transactions each second or 150 trx/sec. When working
with Oracle, there are many possible statistics that can be used for the “transaction”
arrival rate. Common statistics gathered from v$sysstat are logical reads, block
changes, physical writes, user calls, logons, executes, user commits, and user
rollbacks. You can also mix and match as your experience increases. For this paper,
we will simply use user calls.
Q : Queue length. This is the number of transactions waiting to be serviced. This
excludes the number of transactions currently being serviced. We will derive this
M : Number of CPUs. You can get this from the instance parameter cpu_count.
The CPU formulas for calculating averages are as follows:
U = ( S λ ) / M (1)
R = S / (1 - U^M) (2)
Q = ( MU / (1 - U^M) ) – M (3)
3. Phases of forecasting: Based on the above formulae we must do the following things:
- Workload Characterization
- Model Development
and then go to the forecast phase.
Here are the brief explanation of the phases:
Most important part of workload characterization is data gathering. Most likely you will have to
create your own data gathering tools. (could be used “sar”, “vmstat”, “glance” etc) And the data that is to be gathered will be highly dependant upon which forecast model you use
A detailed discussion of characterizing your workload is out of scope for this paper. But the thing to be
remembered is to keep it as simple as possible.
Every forecast uses some kind of model. Even a standard benchmark attempts to model a real-life
system. We use models everyday of our life. We draw pictures of an Oracle instance, we work on
an ER diagram, we play with model airplanes, cars, and trucks (some of us still do), and we
pretend we are building bridges and buildings with wood blocks. These are all models and they
are fundamental to understanding reality.
Once a forecast model has been developed we must statistically understand just how well the
model actually forecasts. The validation process takes an optimized model and runs unseen (i.e.,
data that the model was not been trained/optimized with) historical data through the model and
compares the forecast values against what we know to be true. It’s the classic predicted value
minus the actual value…hence we have an opportunity to statistically measure error and
randomness. The statistical information will be used when making our final forecast related
Very basic and simple statistics become very useful and extremely powerful. After we validate
our forecast model, we can responsibly make forecasting statements.
4. Proven Forecast Models
Simple Math. As the name states, it’s simple. The funny thing is, we use simple mathematical
forecasting all the time without ever thinking about it. For example, if we know an Oracle client
process consumes 10MB of non-shared resident memory and plans are to add another 50 users
(resulting in 50 additional client processes),3 then the system will require 500MB of additional
memory. Because there is no queuing involved with Simple Math, it’s only appropriate for
forecasts like memory, basic IO predictions, and basic networking. I would resist using simple
math for CPU, advanced IO, and advanced network predictions.
Ratio Modeling. Ratio modeling is a fast way to make low precision forecasts. Myself and two
colleagues developed the Ratio Modeling Technique back in the mid-1990’s in response to finding
ourselves in situations where we simply had to deliver a forecast, yet we did not have the time to
use an alternative model. Ratio Modeling works very well when you are quickly budgeting
hardware, assessing and exposing technical risk, validating alternative technical architecture
designs, and especially when sizing packaged applications. The technique enables you to define
the relationship between process categories (e.g., batch processes) and a specific system resource
(e.g., CPU). Ratio Modeling produces no statistical data, so it is truly a back of the envelop
forecast technique. It has been so useful, there is a separate paper devoted entirely to
understanding and using Ratio Modeling.
Linear Regression (LR). Linear regression is fantastic because it’s simple to use (assuming you
have a good tool4) and provides statistical measurements to add strength to your forecast
statements. I typically use LR when I want to quickly predict CPU utilization for a given business
activity like orders shipped per hour or web hits per second. Regression analysis requires data
and data requires a system to sample from. Therefore, LR requires a production environment.
The relationship between just about anything and CPU utilization is not linear. Forecast
techniques and reality demonstrate that at around 55% utilization, queue time is already
consuming around 5% of total response time. And at around 75% utilization, queue time is
already consuming around 15% of total response time. So while I enthusiastically use linear
regression, I will never make a forecast with a utilization over 75%. Simply because even though
the forecast model says it’s OK, I know it’s not because it expects the data to be linear when it’s
clearly not. So be careful when using LR, but use it liberally and appropriately.
Simple Queuing. Queuing theory is a wonderful and powerful forecast technique. It can even be
implemented into an MS-Excel workbook.5 Understanding queuing is a fundamental performance
management concept. The components are simple: a transaction, a queue (or line as we say in the
USA) and a server. When a transaction enters a queue response time starts and queue time starts.
When the transaction finally begins getting served by the server (a CPU perhaps), queue time
stops and service time begins. When the transaction is finished being served, service time stops
and response time stops. This can be symbolically6 shown mathematically:
Rt = St + Qt
Simulation. While common amongst the world’s research community, simulation is virtually
ignored when forecasting Oracle system performance. Cost and education are the primary
reasons, but with some instruction, a little experience, along with some basic statistics, simulation
can be a fantastic solution that yields surprisingly good precision.
Many commercial simulation packages are graphical and can produce some eye-popping ¡°ah ha¡±
moments when presenting the animated results to management. However, these commercial
packages can be relatively expensive and have a substantial learning curve. But there are
alternatives as I will discuss below.
Basically, a simulation forecast model simulates transactions working there way through a system
(in our case, an Oracle based system), while watching and recording what is happening. For
usability and precision7 reasons, the model is highly abstracted with the entire computing system
represented with only a few queues. To enable the simulation to mimic your production system, it
must have the ability to be tuned. The tuning process can take a long time if done by hand.
Suppose the simulation model had just four tunable parameters. The possible combinations are
virtually unlimited yet only a few combinations will yield an optimized/tuned forecast model that
will provide the precision we desire. Even when intelligently adjusting the parameters, to begin
approaching acceptable precision, it could take literally hundreds of individual simulation runs.
Obviously, an automated optimization solution is needed and I think you can see that the
computational power required could be tremendous. Fortunately, there are also solutions to this
challenge that makes low cost simulation forecast models possible. Keep in mind that for basic
demonstrations and when used as an educational tool, running a few simulations can work
With the right tools, simulation models can be built rather quickly, optimized rather quickly
(assuming the computing power is available), yield precise results, and allow for both component
level forecasts as well as the system as a whole. This makes simulation an extremely powerful
5. Scalability and forecasting
The Relationship Between Physical CPUs and Effective CPUs
Scalability is more than a function representing the relationship between workload and throughput.
When more broadly applied, scalability helps us understand the relationship between
physical CPUs and their effective application processing power (sometimes referred to as effective
CPU power, effective CPUs, or simply power).
How Scalability Is Used in Forecasting
Scalability gives our forecast models an additional dose of reality. Queuing theory-based models
assume that when given a CPU, the entire CPU’s power is available for user-related work.
This is not true, in part because a portion of that CPU’s resources must be used to manage that
CPU. Thus, we have the concept of effective CPU power in contrast to physical CPU power, as
noted in the previous section.
To infuse that additional dose of reality into our forecast models, the physical CPUs are
scaled down to effective CPUs before being placed into our forecast models. For example, if the
scalability model shows that our database server’s 10 physical CPUs actually supply 7 effective
CPUs of power, the key is to place 7 CPUs, not 10 CPUs, into the forecast model. If a forecast predicts
12 effective CPUs are required to meet service levels, those 12 effective CPUs are placed
back into the scalability model and transformed into perhaps 16 physical CPUs.
SCALABILITY VS. QUEUING
Scalability is not the same as queuing. Given enough arrivals, queuing will occur even with perfect scalability.
For example, suppose the operating system divides a process into 100 pieces of work, yet there are only
24 CPUs. As a result, there will be 76 pieces of work queued. In this situation, scalability (that is, the ability to parallelize) is not the issue. The issue is the combination of the arrival rate, the service time, and the number of servers is causing the pieces of work to queue. However, if there were 128 CPUs, then 28 CPUs would be idle because the operating system was unable to divide the processes into 128 pieces of work. In this situation,scalability is limiting throughput. So remember that while scalability and queuing are related, they are distinct.
Apart from forecasting, its very interesting to observe how hardware vendors, operating system vendors, database vendors, application vendors, and even businesses continue to find ways to increase parallelism, thereby reducing the throughput-limiting scalability effects. Because in our hearts, even if we are not technically focused, we know that increasing parallelism can increase throughput. And increasing throughput can ultimately mean increased
profits for the business IT was created to serve in the first place!