0% found this document useful (0 votes)
292 views

AWR RPT Reading PDF

Uploaded by

harssh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
292 views

AWR RPT Reading PDF

Uploaded by

harssh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

We have been and always will be about


Oracle Performance Management.

July 13, 2010

Dear Oracle Performance Firefighting enthusiast,

OraPub has decided to make the updated and final chapter of its Oracle Performance
Firefighting book, Oracle Performance Analysis, available to all print and PDF book
owners.

We realize this is unusual for a publisher to do, so we thought you’d be interested in


knowing why we are doing this:

1. The demand for the content has increased due to our new Advanced Analysis
course, the Evaluating Alternative Performance Solutions (or Quantifying Oracle
Performance) technical paper and the related conference presentations, and the general
contact I have with DBAs.

2. I have made quite a few minor changes to clarify the material including some
revised formulas for added precision.

3. Since I am spending quite a bit of time focused in this area, more and more material
is being developed. I want to get this material into the book as soon as I can, while at
the same time allowed everyone who has already purchased the book to have the latest
version of the chapter.

4. While the revisions are available in the PDF version of the book, the printed version
will not contain the updated chapter until possibly October.

Because of these items, I wanted to make sure you had the opportunity to download
the latest and greatest Chapter 9…without paying for another revision. Printing of the
PDF file is still disabled, but the PDF file is not password protected.

I hope you are enjoying the book and enjoy this updated Chapter 9!

To download the fourth printing of Chapter 9, go to:


http://filezone.orapub.com/FF_Book/chap9v4.pdf

Respectfully,

Craig Shallahamer

President & Founder, OraPub, Inc.


Portland, Oregon USA
503.636.0228
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

OraPub books are available at special quantity discounts to use as premiums and sales
promotions, or for use in corporate training programs. For more information, please contact
OraPub at http://www.orapub.com.

Oracle Performance Firefighting


Copyright © 2009, 2010 by Craig Shallahamer
All rights reserved. Absolutely no part of this work may be reproduced or transmitted in any
form or by any means, electric or mechanical, including photocopying, scanning, recording,
or by any information or storage or retrieval system, without prior written permission of the
copyright owner and the publisher.

Please—Out of respect for those involved in the creation of this book and also for their
families, we ask you to respect the copyright both in intent and deed. Thank you.

ISBN-13 : 978-0-9841023-0-3
ISBN-10 : 0-9841023-0-2

Printed and bound in the United States of America.

Trademarked names may appear in this book. Rather than use a trademark symbol with every
occurrence of a trademarked name, we use the names only in an editorial fashion and to the
benefit of the trademark owner, with no intention of infringement of the trademark.

Fourth Printing: June 2010

Project Manager Copy Editor


Craig Shallahamer Marilyn Smith

Cover Design Technical Reviewers


Lindsay Waltz Kirti Deshpande
Dan Fink
Printer Tim Gorman
Good Catch Publishing Gopal Kandhasamy
Dwayne King
Dean Richards

Distributed to the book trade worldwide by OraPub, Inc. Phone +1.503.636.0228 or visit
http://www.orapub.com.

The information in this book is distributed on an “as is” basis, without warranty. Although
precautions have been taken in the preparation of this work, neither the author nor OraPub,
Inc. shall have any liability to any person or entity with respect to any loss or damage caused
or alleged to be caused directly or indirectly by the information contained in this book.
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

CHAPTER

9
Oracle Performance
Analysis

How many times have you been asked, “So what kind of performance improvement can we
expect?” It’s an honest and painfully practical question, which deserves an honest answer.
Unfortunately, while Oracle professionals are proficient in many areas, one area where there
is a glaring gap is in understanding the impact of their proposed performance solutions. The
skill required to answer this question requires deep Oracle performance analysis proficiency.
This chapter’s coverage borders on predictive performance analysis and some serious
mathematics, yet I’ll keep focused on simplicity, practicality, and value (for example, I will
limit the number of Greek symbols and mathematical formulas to the absolute bare
minimum). Furthermore, I will always tie the analysis back to the fundamentals: an Oracle
response-time analysis, OraPub’s 3-circle analysis, and solid Oracle internals knowledge. I do
not intend to explain how to plan IT architectures. My goal is to provide substance,
conviction, and useful information, and to motivate change toward scientifically ranking the
many possible performance-enhancing solutions.
All anticipatory performance questions require a solid grasp of response-time analysis,
which is the first topic of this chapter. The good news is that if you have a solid understanding
of the topics covered in the first few chapters, you are adequately prepared. (If you just
opened this book, I highly recommend you review those first few chapters, as they set the
foundation for this chapter.) Next, I’ll present a fundamental and surprisingly flexible concept
commonly called utilization. Response-time analysis combined with a solid grasp of

327
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

utilization will prepare you for the next topic, which is understanding the various ways our
solutions influence performance. Finally, we’ll dive into anticipating our solution’s impact in
terms of time and utilization. To ensure you can do everything presented in this chapter, I will
provide a number of examples. That’s a tall order for a single chapter, so let’s get started!

Deeper into Response-Time Analysis


Oracle performance analysis that is fundamentally based on response-time analysis has the
inherent advantage of naturally being expanded into anticipating change. To do this, we need
to take the components of response time another level deeper, reduce some of the abstraction I
have been using in this book, and examine the relationship between response-time
components specifically used in an Oracle environment.

Oracle Arrival Rates


When rivers flow into the ocean, people enter an elevator, and transactions enter an Oracle
system, over an interval of time, they arrive at an average rate. It could be that just before
Friday’s time card entry deadline between 4:30 p.m. and 5:00 p.m, 9,000 transactions
occurred.
The average arrival rate is expressed in units of work and units of time. In the prior time
card example, the arrival rate is likely to be expressed in terms of transactions and minutes.
The math involved is very straightforward: divide the total number of transactions that arrived
by the time interval. For the example, it would be 9,000 transactions divided by 30 minutes,
which is 300 trx/min.
There can be a difference between the rate of transaction arrivals (or entry) and the rate
of transaction exits. The actual transactions being processed is known as the workload. A
system is deemed stable when, on average, the transaction entries equal the transaction exits.
If this does not occur, eventually either so many transactions will build up on the system that
it will physically shut down, or there will be so few transactions that no work will be
performed. Because of this equality, for our work with Oracle systems, it is acceptable to refer
to the arrival rate as the workload and vice versa. Use the term that makes your work easily
understandable for your audience.
The arrival rate symbol is universal in all publications and it is the Greek letter lambda.
For the example of an arrival rate (the work performed over a period of time) of 9,000
transactions over a 30-minute period, using symbols and converting to seconds, the arrival
rate calculation is as follows:

9000trx 1m
!= = 300trx /m " = 5trx /s
30m 60s
Figure 9-1 is an actual Statspack report from an Oracle Database 10g Release 2 system
that was experiencing severe cache buffer chain (CBC) latch contention. The Load Profile
section appears near the top of both the Statspack and AWR reports. Over the Statspack
reporting duration, on average, Oracle processed 0.22 transaction each second, 145,325
logical IOs per second, and 415 user calls per second. These reports have captured an initial
value and ending value from a specific statistic, such as commits, database calls, or perhaps
redo generated.

328
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Load Profile Per Second Per Transaction


~~~~~~~~~~~~ --------------- ---------------
Redo size: 22,936.28 103,552.74
Logical reads: 145,324.74 656,112.40
Block changes: 127.49 575.58
Physical reads: 3.68 16.61
Physical writes: 3.39 15.29
User calls: 414.85 1,872.97
Parses: 6.94 31.31
Hard parses: 0.11 0.48
Sorts: 68.61 309.75
Logons: 0.08 0.36
Executes: 192.89 870.86
Transactions: 0.22

Figure 9-1. Shown is a Statspack Load Profile section from an active Oracle Database 10g
Release 2 system experiencing serious CBC latch contention. Each load profile metric can be
used to represent the arrival rate (the workload).

The load profile calculations are as follows:

S1 " S0
!=
T
where:
• λ is the arrival rate
• S1 is the ending snapshot, captured, or collected value.
• S0 is the initial snapshot, captured, or collected value.
• T is the snapshot or snapshot interval.
The following is an example of how we could gather the arrival rate, expressed in user
calls per second, over a 5-minute (300-second) period.

329
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

SQL> select name, value


2 from v$sysstat
3 where name = 'user calls';

NAME VALUE
---------------------------------------------------- ----------
user calls 37660

1 row selected.

SQL> exec sys.dbms_lock.sleep(300);

PL/SQL procedure successfully completed.

SQL> select name, value


2 from v$sysstat
3 where name = 'user calls';

NAME VALUE
---------------------------------------------------- ----------
user calls 406376

1 row selected.

Placing the collected Oracle workload data into the arrival rate formula, expressed in
user calls per second, the arrival rate is 1,342.05 uc/s.

S1 " S0 406,376uc " 37,660uc 368,716uc


!= = = = 1,229.05uc /s
T 300s 300s
The Statspack and AWR reports calculate their load profile metrics the same way, but
they store the collected data in difference tables and use different sampling techniques. For
example, the Statspack facility typically collects data in 60-minute intervals and stores the
data in tables starting with stats$ (the key table is stats$snap). The AWR report draws
from the WRH$ tables, which contain summarized active session history (ASH) data. With just
a little creativity, you can devise your own reports that pull from the Statspack or AWR
tables.
Using the arrival rate formula and the Statspack data shown in Figure 9-2, we can easily
perform the same load profile calculations as the Oracle Statspack developers (shown in
Figure 9-1). The raw data used for the calculations is included in the Snapshot and Instance
Activity Stats sections. The snapshot interval, labeled simply as “Elapsed,” is expressed in
minutes, which is the difference between the beginning snap time and ending snap time. The
workload interval activity has been calculated and is displayed in the Instance Activity Stats
section. While not shown, the interval activity is simply the difference between the ending
statistic value and the beginning statistic value.

330
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Snapshot Snap Id Snap Time Sessions Curs/Sess Comment


~~~~~~~~ ---------- ------------------ -------- --------- ------------
Begin Snap: 2625 02-Oct-08 08:15:02 139 8.2
End Snap: 2635 02-Oct-08 10:45:00 164 8.5
Elapsed: 149.97 (mins)

...

Instance Activity Stats DB/Inst: PDXPROD/PDXPROD Snaps: 2625-2635

Statistic Total per Second per Trans


------------------------------- -------------- -------------- ------------
CPU used by this session 1,675,794 186.2 840.8
consistent gets 1,306,587,656 145,208.7 655,588.4
db block gets 1,044,354 116.1 524.0
user calls 3,732,826 414.9 1,873.0
user commits 1,993 0.2 1.0
user rollbacks 0 0.0 0.0
workarea executions - onepass 2 0.0 0.0
...

Figure 9-2. Two other Statspack sections from the report in Figure 9-1. The timing detail and
a few of the statistics in the Instance Statistics sections are shown. This is enough information
to calculate the workload; that is, the arrival rate expressed in commits, transactions, logical
IOs, and user calls per second.

Using the data shown in Figure 9-2, the user calls per second workload metric is
calculated as follows:

3,732,826uc 1m 3,732,826uc
! uc / s = " = = 414.84uc /s
149.97m 60s 8,998.20s
Referencing the Load Profile’s user calls per second metric shown in Figure 9-1, notice
it closely matches our calculation of 414.84 uc/s. The difference is due to the two-digit
precision of 149.97 minutes. Knowing that an Oracle transaction contains both commits
(statistic user commits) and rollbacks (statistic user rollbacks), you can calculate
the transaction rate and compare that to the load profile. You’ll find the Statspack report does
the math correctly.
The arrival rate is one of the most fundamental aspects of expressing what is occurring
in a system—whether it’s Oracle, a river, or an expressway. As you’ll see in the following
sections, when combined with other metrics, the arrival rate can be used to convey part of the
performance situation and also provides clues about how our proposed performance solutions
will affect the system.

Utilization
If I were standing in front of you right now, I would have in my hands an empty glass and a
pitcher of water. I would hold out the empty glass and say over and over, “capacity.” Then I
would hold out the pitcher and say repeatedly, “requirements.” Then I would ask you, “Is the
water going to fit in the glass? Are the requirements going to exceed the capacity?” In IT,

331
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

what usually occurs is the water is poured in the glass, and we all look away, hoping it will fit.
After a while, we start feeling the water dripping down our arm, and we have a mess. That
mess is the result of the requirements exceeding the available capacity. When this occurs, we
have a performance firefighting situation.
Utilization is simply the requirements divided by the capacity:

R
U=
C
where:
• U is utilization
• R is requirements with the same units as capacity
• C is capacity with the same units as requirements
The performance management trick is to ensure the requirements will fit into the
available capacity. In fact, if we can mathematically express the requirements and capacity—
injecting alterations such of politics, budget, purchases, timing, and new and changing
workloads—we have a much better chance of anticipating change. But if we guess at the
requirements or the capacity, then everyone is just plain lucky if the solution works.

Requirements Defined
Requirements are one of the two metrics we need to derive utilization. Requirements can take
on many useful forms, like CPU seconds consumed per second or consumed in a peak
workload hour, IO operations performed per hour or in a single hour, or megabytes
transferred per second or per hour. We can also change the tense from the past, “CPU seconds
used yesterday between 9 a.m. and 10 a.m.” to “How much CPU is the application now
consuming each second?” or to the future, “How much CPU time do we expect the
application to consume during next year’s seasonal peak?” Don’t tie yourself to a single rigid
requirement definition. Throughout your work, allowing a flexible requirements definition
will help bring clarity to an otherwise muddy situation.
Requirements can also be articulated in terms of more traditional Oracle workload
metrics like user calls, SQL statement executions, transactions, redo bytes, and logical IO. For
example, referring to the workload profile shown in Figure 9-1, which is based on
v$sysstat, the workload can be expressed as 415 uc/s, 0.22 trx/s, 145,325 LIO/s, or
22,926 redo bytes generated per second. Referring to Figure 9-2, the system requirements can
also be expressed as 1,675,794 centiseconds (16,757.94 seconds) of CPU consumed over the
149.97-minute interval. This means on average every second, the Oracle instance consumed
1.862 seconds of CPU, which is a simpler way of saying 1.862 CPU seconds consumed per
second. At first, it may seem strange to speak of CPU consumed like this, but it is very correct
and sets us up for the next topic, which is capacity.
Once the definition of requirements is set, the data must be collected. Most Oracle
systems now collect Statspack or AWR data, which means the data collection is currently
occurring for you. Your job is to extract the necessary information.

332
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Gathering CPU Requirements


When gathering CPU requirements, we typically look at the time model system statistics
(v$sys_time_model) or the instance statistics (v$sysstat). In previous chapters, I
have presented how to gather CPU requirements from the v$sesstat, v$sysstat,
v$ses_time_mode, and v$sys_time_model views.
The second part of Figure 9-2 shows a few instance statistics from a Statspack report.
Based on this data, during the Statspack reporting interval, Oracle processes consumed—that
is, required—1,675,794 centiseconds of CPU, which is 16,757.94 seconds of CPU.
The time model system statistics provide more accurate CPU consumption details.
Figure 9-3 shows the Time Model System Stats section of the Statspack report shown in
Figures 9-1 and 9-2. According to the time model statistics, Oracle processes (server and
background) consumed 16,881.6 seconds of CPU during the reporting interval of 149.97
minutes. Notice that in this case, there is little difference between the instance (shown in
Figure 9-2) and time model (shown in Figure 9-3) CPU consumption statistics.

Time Model System Stats DB/Inst: PDXPROD/PDXPROD Snaps: 2625-2635


-> Ordered by % of DB time desc, Statistic name

Statistic Time (s) % of DB time


----------------------------------- -------------------- ------------
sql execute elapsed time 578,732.7 99.2
DB CPU 16,749.8 2.9
parse time elapsed 300.2 .1
PL/SQL execution elapsed time 252.0 .0
hard parse elapsed time 189.5 .0
connection management call elapsed 115.6 .0
RMAN cpu time (backup/restore) 96.7 .0
repeated bind elapsed time 55.9 .0
PL/SQL compilation elapsed time 5.7 .0
hard parse (sharing criteria) elaps 2.9 .0
sequence load elapsed time 1.9 .0
failed parse elapsed time 0.0 .0
hard parse (bind mismatch) elapsed 0.0 .0
DB time 583,482.3
background elapsed time 12,739.5
background cpu time 131.8

Figure 9-3. Shown is the Time Model System Stats section from the same Statspack report as
shown in Figures 9-1 and 9-2. During the Statspack reporting interval, Oracle server
processes consumed 16,749.8 seconds of CPU and Oracle background processes consumed
131.8 seconds of CPU.

Along with using Statspack or AWR to gather and report CPU consumption, you can
also easily collect this information yourself. Simply gather the initial value, final value, and if
you want, the consumption per second, over the desired time interval. Figure 9-4 shows a
code snippet used to collect CPU consumption based on v$sys_time_model. During the
60-second interval, the Oracle instance processes consumed 82.4 seconds of CPU; that is, on
average 1.37 seconds each second.

333
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

SQL> def interval=60


SQL> col t0_s new_value t0_s
SQL> select sum(value)/1000000 t0_s
2 from v$sys_time_model
3 where stat_name in ('DB CPU','background cpu time');

T0_S
----------
498.995974

1 row selected.

SQL> exec sys.dbms_lock.sleep(&interval);

PL/SQL procedure successfully completed.

SQL> select sum(value)/1000000 t1_s,


2 sum(value)/1000000-&t0_s CPU_s_Consumed,
3 (sum(value)/1000000-&t0_s)/&interval CPU_s_Consumed_per_sec
4 from v$sys_time_model
5 where stat_name in ('DB CPU','background cpu time');
old 2: sum(value)/1000000-&t0_s CPU_s_Consumed,
new 2: sum(value)/1000000-498.995974 CPU_s_Consumed,
old 3: (sum(value)/1000000-&t0_s)/&interval CPU_s_Consumed_per_sec
new 3: (sum(value)/1000000-498.995974)/60 CPU_s_Consumed_per_sec

T1_S CPU_S_CONSUMED CPU_S_CONSUMED_PER_SEC


---------- -------------- ----------------------
581.431481 82.435507 1.37392512

1 row selected.

Figure 9-4. Shown is a code snippet used to collect and then determine instance CPU
consumption, based on v$sys_time_model, over a 60-second interval. The CPU
consumed (82.4s) and also the CPU consumed per second (1.37s) are displayed.

Gathering IO Requirements
Gathering IO requirements is more complicated than gathering CPU requirements. Oracle9i
Release 2 and earlier require querying from both v$sysstat and v$filestat, whereas
later Oracle releases require querying only from v$sysstat. And depending on the
information desired, different statistics are required. The following snippet shows the
formulas for raw Oracle IO consumption (requirements) for Oracle9i Release.2 and earlier
versions:

Server process read IO operations


= sum(v$filestat.phyrds)
Server process read MBs
= sum(v$filestat.phyblkrd X block size) / (1024 X 1024)

Database writer and server process write IO ops


= sum(v$filestat.phywrts)
Database writer and server process write MBs
= sum(v$filestat.phyblkwrt X block size) / (1024 X 1024)

334
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Log writer write IO operations


= v$sysstat.redo writes
Log writer write MBs
= (v$sysstat.redo size) / (1024 X 1024)

The following formulas are appropriate for versions later than Oracle9i Release 2:

Server process read IO operations


= v$sysstat.physical read IO requests
Server process read MBs
= (v$sysstat.physical reads X block size) / (1024 X 1024)

Database writer and server process write IO ops


= v$sysstat.physical write IO requests
Database writer and server process write MBs
= (v$sysstat.physical writes X block size) / (1024 X 1024)

Log writer write IO operations


= v$sysstat.redo writes
Log writer write MBs
= (v$sysstat.redo size) / (1024 X 1024)

Figure 9-5 shows the instance statistics we need to calculate Oracle’s IO consumption
(its requirements) over Statspack’s 149.97-minute interval. The only missing piece of
information is the Oracle block size, which is needed to determine the MB/s figures. While
not shown, the value is found in the Instance Parameter portion of any Statspack or AWR
report. For this example, the db_block_size is 8192, which is 8KB.

Instance Activity Stats DB/Inst: PDXPROD/PDXPROD Snaps: 2625-2635

Statistic Total per Second per Trans


--------------------------------- ------------ -------------- ------------
physical read IO requests 24,027 2.7 12.1
physical reads 33,106 3.7 16.6
physical write total IO requests 22,959 2.6 11.5
physical writes 30,466 3.4 15.3
redo size 206,380,604 22,936.3 103,552.7
redo writes 3,314 0.4 1.7

Figure 9-5. Based on the same data as Figures 9-1, 9-2, and 9-3, shown are Oracle IO-
related consumption (requirements) metrics. This is an Oracle Database 10g Release 2
system, so all the metrics can be gathered from the Instance Statistics section (based on
v$sysstat), and then used to calculate read and write IO requirements for the server
processes and background processes in both megabytes per second (MB/s) and IO operations
per second (IOPS).

Using the Oracle9i Release 2 and later formulas and based on the Statspack information
shown in Figure 9-5, to determine the total IO read and write operations, we must sum the
server process, database writer background process, and log writer background process IO
read and write operations. The results are as follows:

335
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Server process read IO operations = 24,027


DBWR and server process write IO operations = 22,959
LGWR write IO operations = 3,314

The total of read and write IO operations over the 149.97-minute interval was 50,300.
To get the standard IO operations per second (IOPS), simply divide the total IO operations by
the reporting interval, remembering to convert to seconds.

operations 50,300ops 1m
IOPS = = ! = 5.59ops /s = 5.59IOPS
time 149.97m 60s
When the IO administrator asks for Oracle’s IO requirements, based on the Statspack
report time period, you can confidently say Oracle’s IO requirements were 5.59 IOPS. And if
the IO administrator wants the breakdown by read and write operations, or even in megabytes
per second, that can also be provided. Be sure the IO administrator understands this is truly
Oracle’s IO requirements and it is likely the IO subsystem will require multiple physical IO
actions to complete a single Oracle IO operation.

Capacity Defined
Capacity is the other aspect of utilization. As I mentioned earlier, capacity is the empty
glass—it is how much a resource can hold, can provide, or is capable of. Like requirements,
capacity can take on many forms. A specific database server with a specific configuration has
the capability to provide a specific number of CPU cycles each second or a number of CPU
seconds each second. An IO subsystem has a specific capacity that can be quantified in terms
of IOPS or MB/s. It could also be further classified in terms of IO write capacity or IO read
capability. But regardless of the details, capacity is what a resource can provide.
Gathering CPU Capacity
We have already touched on gathering capacity data in a number of areas of this book, so I
will make this brief. The trick to quantifying capacity is defining both the unit of power and
the time interval. For example, the time interval may be a single hour, and the unit of power
may be 12 CPU cores. Combining the time interval and the power unit—12 CPU cores over a
1-hour period of time—we can say the database server can supply 720 minutes of CPU power
over a 1-hour period (12 CPUs × 60 minutes) or 43,200 seconds of CPU power over a 1-hour
period (12 CPUs × 60 minutes × 60 seconds / 1 minute).
A very good way to quantify CPU capacity is based on the number of CPU cores.1 The
number of CPU cores can be gathered from an operating system administrator. Additionally,
the v$osstat view is available with Oracle Database 10g and later versions. While obvious
in the particular Statspack report shown in Figure 9-6, Oracle does not always clearly label
the number of CPU cores. In this case, you can usually spot the value that represents the
number of CPU cores, but you should always double-check, because the number the CPU
cores is so important. It is used to calculate the database server’s CPU capacity and is a key
parameter when calculating capacity and utilization. Figure 9-6 indicates there are two CPUs,

1
The number of CPUs or the number of CPU threads does not accurately reflect the processing power for an
Oracle-based system. There are many reasons for this, but in summary, this is based on Oracle’s multiple process
architecture that can take advantage of multiple cores, but not so much multiple threads per core or CPU.

336
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

but in reality, there is a single dual-core CPU providing two CPU cores’ worth of processing
power.

OS Statistics DB/Inst: PDXPROD/PDXPROD Snaps: 2625-2635


-> ordered by statistic type (CPU use, Virtual Memory), Name

Statistic Total
------------------------- ----------------------
BUSY_TIME 1,802,616
IDLE_TIME 0
SYS_TIME 96,260
USER_TIME 1,706,356
LOAD 93
OS_CPU_WAIT_TIME 69,269,700
VM_IN_BYTES 69,025,792
VM_OUT_BYTES 0
PHYSICAL_MEMORY_BYTES 17,171,005,440
NUM_CPUS 2

Figure 9-6. Shown is the operating system statistics portion of the same Statspack report
shown in Figures 9-1, 9-2, 9-3, and 9-5. This particular database server has a single dual-
core CPU.

CPU capacity can be defined as the duration multiplied by the number of CPU cores.

CPU capacity = duration × number of CPU cores

For example, as shown in Figure 9-1, the time interval is an unusual 149.97 minutes.
(Usually, Statspack reports are run for a single hour or two.) Therefore, the CPU capacity
based on the Figure 9-6 Statspack report is as follows:

299.94 min = 149.97 min × 2 CPU cores

Converting to seconds, this CPU subsystem can provide 17,996.40 seconds of CPU
capacity within the 149-minute interval.
Gathering IO Capacity
Unlike when I need to gather CPU capacity, if I must determine an IO subsystem’s capacity, I
ask the IO administrator. As detailed in the “Gathering IO Requirements” section earlier, we
have the information needed to determine Oracle’s IO requirements, but determining IO
capacity with authority is best done by the IO subsystem team. If your IO subsystem is simply
a series of SCSI drives daisy-chained together, as was done in the 1980s and early 1990s, then
simple math can be used to predict the IO subsystem’s capacity. However, the combination of
read and write caching and batching from both Oracle and the operating system virtually
eliminates the possibility of deriving a reliable IO capacity figure. Surely, we can gather and
even predict IO requirements, but predicting IO capacity is something I simply will no longer
attempt.
When talking with your IO administrator about capacity, ask for both read and write
capacity in either MB/s or IOPS. While the IO administrator may not classify the read and
write requirements, because Oracle systems have very unique read and write characteristics,

337
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

just to be safe, it is always best to ask for the requirements of both. Since we can gather IO
requirements in both MB/s and IOPS, it really doesn’t make much difference to us in which
form capacity is delivered or expressed.

Calculating Utilization
With both requirements and capacity defined and the data collection covered, let’s use them
together. The classic requirements-versus-capacity indicator is utilization. It can be applied in
a wide variety of situations—from river water flow and factory production to Oracle
performance analysis.
Oracle CPU Utilization
To calculate Oracle’s CPU utilization, we need Oracle’s CPU requirements (consumption)
and the operating system’s CPU capacity. Figure 9-3 provides the CPU requirements of
16,881.6 seconds of CPU and Figure 9-6 provides the capacity details of two CPU cores.
Figure 9-2 provides the sample interval necessary to complete the CPU capacity calculation.
We place these numbers into the utilization formula:

R 16,881.6s 1m 16,881.6
U= = ! = = 0.938 = 94%
C 2cores !149.97m 60s 17,996.4
This means that during the reporting interval, the Oracle instance consumed 94% of the
available CPU! This also means that only 6% CPU power remains for all other processes.
Obviously, this server is experiencing a severe CPU bottleneck.
In the preceding formula, I purposely included the conversion factors. Notice that all the
time units cancel out, leaving us with a raw numeric without any reference to time. I could
have carried the CPU core’s metric to the final value of 94%, since 94% of the CPU core
capacity was utilized, but since we normally don’t represent utilization like this, it could cause
some confusion.
It is important to understand this calculation is not the operating system CPU utilization,
which can be gathered using an operating system command such as vmstat. What we
calculated with this utilization formula is Oracle’s CPU consumption related to the database
server CPU capacity. This is commonly called the Oracle CPU utilization.
Operating System CPU Utilization
Besides gathering operating system CPU utilization using standard operating system
commands such as vmstat and sar, starting with Oracle Database 10g, both operating
system CPU consumption and CPU capacity details are available through the v$osstat
view. We still need the sample interval, which is 149.97 minutes, as shown in Figure 9-2.
Database server CPU consumption is shown in the BUSY_TIME statistic in Figure 9-6
and represented in centiseconds. The BUSY_TIME statistics is the sum of any and all
database server process CPU consumption during the sample interval. Based on Figure 9-6,
during the reporting interval, all operating system processes consumed a total of 1,802,616
centiseconds of CPU.
Database server CPU capacity is calculated in the same manner as the Oracle CPU
utilization. Placing both the requirements and capacity into the utilization formula and
applying the appropriate conversion factors, we have this calculation:

338
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

R 1,802,616cs 1s 1m 1,802,616
U= = ! ! = = 1.00 = 100%
C 2cores !149.97m 100cs 60s 1,799,640
This means the CPU subsystem is operating on average at 100% utilization! We should
have expected this, since Oracle is consuming 94% of all the available CPU. Calculating both
Oracle CPU utilization and the operating system CPU utilization, we have a very nice
confirmation that this is the only instance running on the database server and also that the
CPU subsystem is experiencing a raging bottleneck.
The utilization formula implies a linear relationship between requirements and
utilization. In other words, if the requirements double, so will the utilization. When you are
asked, “But how do you know it really works like this?” show the graph in Figure 9-7, or
better yet, create one yourself. While no real Oracle system will match this perfectly, for
CPU-intensive systems, the linearity is very evident.
Figure 9-7 is an example of using SQL executions as the workload (logical IO would
have also worked very well). The solid line is the actual sample data plotted, and the dotted
line is a linear trend line added by Microsoft Excel. The correlation between the real data and
the trend line is 0.9328, which represents a very strong correlation! In the upcoming sections,
I will demonstrate how to use this linear relationship when anticipating the impact of a
firefighting solution.

Figure 9-7. Shown is a demonstration of the linear relationship between utilization and
requirements. This graph is not based on theory or mathematical formulas, but data sampled
from a CPU-intensive Oracle system. The solid line is based on the actual sample data, and
the dotted line is a linear trend line. Their correlation coefficient is a very strong 0.9328.

IO Utilization
Just as with CPU utilization, IO utilization can be calculated. However, because of the
possibility of file system buffer caching and IO subsystem caching, our Oracle-focused
utilization calculation is a worst-case scenario (no caching assumed), includes only the

339
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

instances we sample data from, and does not include any non-Oracle-related IO. So, the worth
of our calculation is limited (at best). The higher worth metric is Oracle’s IO requirements,
which we calculated in a previous section.
The IO team can use Oracle’s IO requirements, apply whatever caching metric they
wish, and also add any other IO-related metrics. While our utilization calculation has limited
value, when comparing the theoretical worst-case utilization with the actual IO subsystem
utilization, it will demonstrate the effectiveness of caching, changing IO subsystem capacity,
and possibly various tuning efforts.
Suppose the IO administrator told you the IO subsystem has a capacity of 250 IOPS.
Earlier, in the “Gathering IO Requirements” section, we calculated that, during the reporting
interval, Oracle processes generate 5.59 IOPS. Once again, using the utilization formula, we
have this calculation:

R 5.59IOPS
U= = = 0.022 = 2.2%
C 250IOPS
So, while the CPU subsystem is running at 100% utilization, if the IO subsystem is
receiving only this specific Oracle instance’s IO requests, and assuming there is no non-
Oracle caching, the IO subsystem would be running at around 2.2% utilization. It appears the
IO subsystem has plenty of capacity.

Oracle Service Time


Service time is how long it takes a single arrival to be served, excluding queue time. If the
arrival is defined to be an Oracle user call, then the service time may be something like 4.82
milliseconds per user call, or 4.82 ms/uc.
While the total service time includes all the time to service transactions within a given
interval, service time is specifically related to a single arrival service. The unit of time should
be in the numerator, and the unit of work should be in the denominator. Depending on your
data source, the information may be provided as work over time. Make sure to switch it to
time over work. If you forget to do this, any other calculation based on the service time (for
example, utilization and response time) will likely be incorrect.
Besides the general utilization formula presented in the previous sections, the classic
utilization formula is as follows:

St !
U=
M
where:
• St is the service time.
• λ is the arrival rate.
• M is the number of transaction servers, such as a CPU core.
It is important to understand the service time and arrival rate are independent and also
have a direct and linear relationship with utilization. Theoretically, when the arrival rate
increases, service time does not increase. What may increase, if the workload increases

340
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

enough, is the queue time. More practically, if it takes 10 ms to process one user call when the
system is lightly loaded, then it will continue to take 10 ms to process one user call when the
system is heavily loaded. This is why response-time curves are more or less flat until queuing
sets in. Remember that the users do not experience only service time, but the combination of
service and queue time.
As the utilization formula shows and as you might expect, if the arrival rate doubles, the
utilization will also double. In fact, as Figure 9-7 demonstrates, their relationship is not only
linear in theory and in CPU-intensive Oracle systems, it is linear in practice. This is also true
with the service time. For example, if we work on tuning SQL a statement that results in a
more efficient execution plan and achieve a 50% database server CPU consumption decrease,
and nothing else changes, we can expect the utilization to also decrease by 50%. Now suppose
the CPUs were replaced, and the new CPUs can process an Oracle workload twice as fast. In
this case, and if nothing else changes, we would expect the utilization to also drop by 50%.
For precise forecasts, this formula will be slightly adjusted. But when anticipating and
evaluating alternative performance solutions, this works beautifully.
We do not gather service time directly, but instead derive it from existing data. It turns
out that nearly always (and fortunately for us), we have parameters for all but the service
time. As an example, let’s use the data contained in Figures 9-2 and 9-6. Figure 9-2 shows
that during the Statspack interval, on average, 414.9 user calls were processed each second.
This will be our arrival rate: 414.9 uc/s. Based on Figure 9-6, we know the number of CPU
cores is two and as calculated in the operating system CPU utilization at 100%. Solving the
utilization formula for the service time, plugging in the numbers, and converting time to
milliseconds, we have the following calculation:

UM 1.00 " 2 2.00 1000ms


St = = = = 0.00482s /uc " = 4.8ms /uc
! 414.9uc /s 414.9uc /s 1s
Notice if we are careful with the units, the service time naturally results in the unit of
time in the numerator and the unit of work in the denominator.
Deriving the IO service time based on the utilization formula is fraught with problems
because of non-Oracle IO caching. Even more problematic is knowing the actual number of
active IO devices dedicated to an Oracle instance. But it gets worse. Having other IO activity
on a specific instance’s database files IO device further degrades the service time calculation
quality.
In summary, calculating IO service time is unreliable. The good news is that we are
more interested in IO response time, which is easy to collect, as discussed shortly.

Oracle Queue Time


When a transaction arrives into a system ready to be serviced, it may need to wait, or queue,
before servicing begins. Service time does not include wait time; that is, queue time. For
example, when an IO subsystem is 2.2% utilized, the entire IO processing time is virtually all
service time and no queue time.
Queue time can be calculated a number of ways. The simplest way, which is sufficient
for our purposes, is to subtract the service time from the total request time:

Qt = Rt ! St

341
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

The total request time is more formally called response time and is discussed in the next
section. The units for queue time are the same as for service time, such as milliseconds per
logical IO.
Looking at the classic utilization formula, you see that utilization can increase if the
service time, the queue time, or both increase. As the classic utilization formula indicates, and
as Figure 9-7 demonstrates occurs in CPU-intensive Oracle systems with a consistent
workload mix, when utilization increases, it is because the arrival rate (the workload) is
increasing, not because service time is increasing.
Surely, the total service time is increasing with each arrival, but the service time per
arrival (called simply the service time) remains roughly the same. It is easy to get the two
terms confused. We know that when the workload increases, the total CPU consumption
increases. But along with the CPU consumption increase comes a workload increase. The two
offset each other, keeping the service time the same while the utilization increases.
When CPU-intensive Oracle algorithms begin to break down, you will notice that
service time starts to increase as the arrival rate increases. If you look ahead to Figure 9-9,
you will see a slight service time upward slope. As discussed back to Chapter 3, the CPU-
intensive Oracle latching acquisition algorithm, with its combination of spinning and sleeping,
does a tremendous job of limiting the increase in service time as the workload increases. What
you feel and what users feel when performance begins to degrade is probably queue time
increasing, rather than service time increasing.
In Chapter 4, we covered how CPU and IO subsystems are fundamentally different from
a queuing perspective. The central difference is there is only one CPU queue, but each IO
device has its own queue, so transactions have no choice but to read or write to a given IO
device, regardless of its queue size. This can result in a busy device with a massive queue,
while another device has little or no queue time. As a result, CPU subsystems with multiple
cores exhibit little queue time until they are utilized starting around 70%, whereas IO
subsystems immediately exhibit queue time. As I detailed in Chapter 4, this is true even for
perfectly balanced IO subsystems.
Figure 9-8 contrasts an eight-device IO subsystem (solid line) and an eight-CPU core
subsystem (dotted line) having the exact same service time. We know their service times are
the same because, at a minimal arrival rate when no queuing occurs, their response time is
exactly the same. With the understanding that service time does not change, regardless of the
arrival rate, we know that any increase in the response time is due to queue time.
Figure 9-8 shows that a CPU subsystem can maintain a fairly consistent response time
until it reaches near capacity.2 This means little or no queuing exists until the arrival rate
significantly increases. In contrast, IO subsystems start queuing immediately, as reflected in
the upward-sloping response-time curve.

2
This is true for multicore CPU subsystems. The greater the number of CPU cores, the flatter the response-
time curve and the steeper the elbow of the curve.

342
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-8. Shown in the classic response-time curve contrasting an eight-device IO


subsystem (solid line) and an eight-CPU core subsystem (dotted line). Even with a perfectly
balanced IO subsystem, without advanced algorithms and a significant amount of IO caching,
IO requests nearly always contain significant amounts of queue time.

Oracle Response Time


From the previous chapters, you know that response time equals service time plus queue time.
In fact, at the highest level, our Oracle firefighting methodology is based on classifying time
in response time’s two foundational categories: service time and queue time. Not only does
this allow a very systematic diagnostic approach, but it also provides a wonderful and natural
bridge between firefighting and predicting the impact of our possible solutions. Before we
move into quantitatively anticipating our solution’s impact, some additional details about
service time, queue time, and response time specifically related to Oracle systems need to be
covered.

The Bridge Between Firefighting and Predictive Analysis


When performing an Oracle response-time analysis (ORTA), we place Oracle server and
background process time into the classic queuing theory buckets: service time and queue time.
Keeping in mind that all Oracle server and background processes are either consuming CPU
or posting a wait event,3 as I’ll detail in the following sections, we naturally transform their
CPU time into service time and their non-idle wait time into queue time. This creates a bridge,
or link, between firefighting and predictive analysis. This bridge is supported by standard
queuing theory mathematical formulas, some already presented in earlier sections, which we
will use to quantify the anticipated results of our firefighting solutions.
Once I present a few more foundational elements, in addition to Figure 9-7, I will
demonstrate how Oracle systems do, in fact, operate in a manner that follows queuing theory,

3
Oracle does not guaratentee all system calls are instrumented. As a result, there can be missing time. Also,
Oracle CPU consumption includes Oracle processes waiting for CPU and also queuing for CPU. As a result, in a
CPU-saturated system, Oracle may report CPU consumption higher than actually occurred.

343
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

and by performing an ORTA, we can indeed anticipate our proposed solution’s effect. And
this does not apply only to Oracle-centric solutions, but also to application-focused and
operating system-focused solutions.

Total Time and Time Per Workload


When performing an ORTA, we gather all of a category’s time within a sample interval. For
example, consider the data presented in Table 9-1. This hypothetical data was gathered during
a 1-hour interval, during which Oracle server and background processes consumed (required)
50 seconds of CPU time. We will place this 50 seconds of CPU consumed into the service
time category. During this 1-hour interval, Oracle processes completed 20,000 block changes
and 10,000 SQL executions. These are two metrics commonly used to represent the total
workload. The block change service time is therefore 0.00250 s/bc, which is the total service
time divided by the total block change workload (0.00250 = 50 / 20000).
Table 9-1 also details the total queue time and the queue for a single arrival—that is, unit
of work. The point is, as previously stated, there is a difference between the total service time
and the service time, and also between the total queue time and the queue time. In addition,
we can interject potentially useful and relevant arrival rate metrics, such as block changes,
SQL executions, redo entries, block changes, or logical IO. Selecting a useful workload
metric is discussed in the “Response-Time Graph Construction” section later in this chapter.

Table 9-1. Relationships between time components over a 1-hour interval

Time (sec) per Time (sec) per


Time Category Totals
Block Change SQL Exec

Response time 555 sec 0.02775 0.5550


Service time 50 sec 0.00250 0.0050
Queue time 505 sec 0.02525 0.0505
IO time 500 sec 0.02500 0.0500
Other time 5 sec 0.00025 0.0005

Workload
Block changes 20,000
SQL executions 10,000

CPU Service and Queue Time


Back in Chapter 5, I mentioned Oracle has a limited perspective in classifying CPU time
based on when a transaction is being serviced by a CPU or when it is waiting in the CPU run
queue. In other words, Oracle does not have the ability to split CPU time into service time and
queue time. When we gather CPU consumption from either the instance statistics views or the
system time model views, what we collect as CPU time and typically classify as CPU service
time actually contains both CPU service time and CPU queue time. Oracle has no way of
knowing the difference and reports the total, which is, in the truest sense, CPU response time.

344
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

This type of perspective is common. In fact, computing systems are actually composed
of a series of interconnected queuing systems. This is called a networked queuing system. An
Oracle transaction does not really enter a single large system, wait to be serviced, is serviced,
and then exits to return the result. It enters a complex series of queuing systems, moving from
one system to the next, each with the possibility of queuing and then servicing the transaction.
When the transaction exits the complete system, the sum of all the service time and the sum of
all the queue times are presented as simply the service time and the queue time. So, this type
of abstraction is very common.
This abstraction is not a problem for three additional reasons:
• It allows our performance analysis to move forward without insignificant details
getting in the way of the problem at hand. Our goal should always be to keep
situations as simple as practically possible. Added complexity and precision take
effort and resources that should not be expended unless absolutely necessary.
• All time is accounted for; that is, time is not lost or unaccounted for. It is simply
classified in an abstracted and summarized format.
• Significant queuing begins to occur near the elbow of the curve, which happens
between 80% to 90% depending on the number of CPU cores. When evaluating
alternative firefighting solutions, we want to be nowhere near the elbow of the curve!
When an Oracle database server CPU subsystem is heavily utilized, we know
performance will be significantly degraded. Knowing the degree of “badness” is not important
when evaluating alternative firefighting solutions. So, when performing our ORTA, it is not a
problem to abstract and simply call this value CPU service time.

IO Service and Queue Time


When IO times are gathered using Oracle’s wait interface, from Oracle’s vantage point, it is
actually more of an IO response time. When Oracle issues an IO request to the operating
system, it waits until the IO request is satisfied. When the IO subsystem processes the IO
request, there is service time (perhaps transferring the data) and queue time (perhaps disk
latency and head movement). The gettimeofday system call Oracle issues does not
distinguish IO service and queue time, and therefore Oracle has no way of knowing the
classification. But just as with CPU time, this does not present a problem, primarily because
as performance analysts, we are interested in how long an IO call takes. The time components
of the call can be of interest, but it’s the total time—the response time—we need to know.
When we perform an ORTA, we classify all IO time as queue time subclassification.
This may seem like an unfortunate and desperate abstraction, but it actually fits perfectly from
a database perspective. If no IO occurs, Oracle satisfies all requests consuming only CPU
time. But as the workload increases and some of this work requires IO, response time begins
to increase; that is, queue time begins to increase; that is, IO time begins to increase. The
pattern fits very nicely into an ORTA.
In summary, both our CPU service time and IO queue time abstractions fit very nicely
into an ORTA, providing us with the opportunity to apply predictive mathematics to evaluate
alternative firefighting solutions.

345
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Oracle Response Time in Reality


The classic response-time curve in Figure 9-8 highlights the differences between CPU and IO
subsystems. It turns out that real Oracle systems operate somewhere between the two. The
dotted line in Figure 9-8 represents an Oracle system that operates completely and only with
CPU. In others words, there is no physical IO, only logical IO activity. In contrast, the solid
line represents an IO-centric Oracle system. No Oracle system can operate with only IO,
because there must be CPU resources consumed to run processes, which includes processing
the IO once it has been read from disk.
Figure 9-9 graphically shows a system with an intense logical IO load, which consumes
virtually no physical IO resources. While you can see the classic response-time curve, it is not
nearly as nice and neat as the mathematics would have us believe. But this is the reality of the
situation, and as they say, it is what it is. In all fairness, the graph would have looked more
like a theoretical response-time curve if I gathered more samples at each workload (and
plotted the averages) and increased the sample time from 120 seconds to perhaps 360 seconds
or an hour. But I wanted you to see that even with limited data, the CPU subsystem does
exhibit queuing theory characteristics. Every system and every load will produce a different
graph, but from an abstracted view, they will have similarities, and we will use these to
anticipate the impacts of our possible firefighting solutions.

Figure 9-9. Shown is an actual response-time curve based on a heavily CPU-loaded Linux
Oracle Database 10g Release 2 system with a four-CPU core subsystem. The dotted line is
the service time (CPU), and the solid line is the response time (CPU plus all non-idle wait
time), with the difference between the two being queue time (non-idle wait time). The initial
large jump in queue time occurred at 75% utilization, and the last data point occurred at 98%
utilization.

The arrival rate in Figure 9-9, which is the horizontal axis, is simply the number of
logical IOs (v$sysstat: buffer gets plus consistent gets) processed per
millisecond. The service time was calculated by dividing the total service time
(v$sys_time_mode: DB CPU plus background cpu time) by the total number of

346
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

logical IOs. The queue time was calculated by dividing all non-idle wait time by the number
of logical IOs. From a mathematical perspective, the data collection interval is irrelevant as
long as all the data is gathered during the same interval. But if you are curious, the sample
interval was 120 seconds.
Figure 9-10 graphically shows a system with an intense physical read IO load. Because
the system is experiencing a heavy physical IO load, the response-time curve is likely to
correlate with physical IO-related statistics. For this figure, I chose the instance statistic
physical reads.4 The service time metric is the sum of the time model statistics DB
CPU for server process CPU time and background cpu time for the background
process CPU time.5 The queue time consists of all non-idle wait event time. With only these
simple time classifications, the graph in Figure 9-10 was created. As you’ll see later in the
chapter, we can use graphs like this to anticipate our solution’s impact.

Figure 9-10. Shown is an actual response-time curve based on a heavily read IO-loaded
Linux Oracle Database 10g Release 1 system. The dotted line is the service time (CPU), and
the solid line is the response time (CPU plus all non-idle wait time), with the difference
between the two being queue time (non-idle wait time). The unit of work is physical Oracle
blocks read, which is the instance statistic physical reads. The initial large jump in
queue time occurred when IO read concurrency (wait event, read by other session)
suddenly appeared and eventually become about one-third of all the non-idle wait time.

There is so much to be gleaned from this single graph. While the top wait event was db
file scattered read, notice the queue time for each arrival (it’s the difference
between the response time and the service time lines). Before queuing really sets in, the queue
time (not service time or response time) is around 0.01 ms! This means that while the
requested blocks were not in Oracle’s buffer cache, they were in some other cache. The

4
The instance statistic physical reads signifies the number of Oracle blocks that Oracle server processes
had to request from the operating system because the blocks did not reside in Oracle’s buffer cache
5
I don’t mean to insult your intelligence.

347
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

operating system was able to provide Oracle a requested block, on average, in around 0.01
ms.
While these details are not shown, when queuing took a second dramatic increase at
around an arrival rate of 7E+06 pio/ms, it wasn’t because the physical devices became busy.
They averaged at around only 3% utilized. Queue time took this big increase because CPU
utilization reached around 80%. Because IO requests were being satisfied primarily with CPU
resources, from an Oracle performance perspective, IO response time was based on CPU
utilization! Since we can see that Figure 9-10 does exhibit response-time curve characteristics,
in this particular situation, we can use a CPU queuing theory model to anticipate Oracle IO
read times.
However, this is a very specific situation (though not as unusual as most people think),
in which the CPU is used heavily to satisfy IO requests. The majority of Oracle systems have
their IO requests satisfied from a dynamic mix of physical disk IO and caching. As a result,
with an IO intensive system our queuing theory mathematics will need to be more IO focused
than CPU focused.
Another challenge when anticipating IO response time is visually demonstrated by the
initial response time jump around an arrival rate of 4.5+E06 pio/ms. This occurred not
because of the IO subsystem, or even the CPU subsystem, reaching capacity. It occurred
because of concurrency issues! This jump in queuing time occurred when the server processes
started asking for the same database blocks to be brought into the cache at nearly the same
time. Eventually, this concurrency issue accounted for about 30% of the total queue time. This
is an Oracle Database 10g Release 1 system, and the currency wait event is read by
other session. In earlier Oracle versions, the wait event would have been buffer
busy wait.
Did you notice the initial drop (not increase) in the response time? This occurs in Oracle
systems because cache efficiencies (Oracle and the operating system) increase as the
workload begins to increase. For IO subsystems, I have seen response-time curves (based on
real data) that look like a smiling face because of the significant cache efficiency effect. This
is what we want to see! Eventually, however, as the workload increases, some component in
the system will reach its capacity limit (in Figure 9-10, it was the CPU subsystem and
concurrency issues), and the classic response-time curve elbow will appear.
Now that I’ve detailed how to collect data and plot the actual response-time graph, it’s
time to move on to creating a response-time graph that is more general and suitable for
anticipating the impact of a firefighting solution.

Response-Time Graph Construction


This is where the real fun—and also the real risk—begins. The moment you draw a picture of
your system, all eyes will be focused on you. Your objective is to convey the situation as
simply as possible, without misleading anyone. Simplicity and abstraction are your friends.
The moment you attempt to be precise or get heavily into the mathematics, you’re doomed.
This book is not about predictive performance analysis, and this is not our focus here either.
Our goals are to convey the situation and anticipate the general effect of our proposed
solutions. Providing more information promotes better decisions about which solutions to
implement and in what order.
While the examples used in this section are based on an entire Oracle instance activity,
everything described can also be applied to a single session or a group of sessions. For

348
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

example, instead of gathering CPU consumption and wait time from v$sysstat,
v$sys_time_model, and v$system_event, when focusing on a particular session or
group of sessions, use v$sesstat, v$ses_time_model, and v$session_event.
Obviously, to calculate operating system utilization, the v$osstat view will have to be
used. But a session’s or group of session’s contribution to the utilization can be calculated in
the same way as the Oracle instance CPU utilization (which is simply called Oracle CPU
utilization).

Selecting the Unit of Work


When creating a response-time graph representing a real system, it is important to use an
appropriate unit of work. For your graph to provide value—mimic and show any relation to
reality—it must use a unit of work that relates to the queue time issue. For example, as Table
9-2 shows, if the bottleneck is CPU, logical IO processing will mostly likely correlate very
well with CPU consumption. If the bottleneck is IO, the number of SQL executions, the
number of block changes, or the number of physical block reads may correlate very well with
the IO activity. A good unit of work, when increased, will push the response time into the
elbow of the curve.

Table 9-2. Selecting a unit of work based on the bottleneck

Bottleneck Focus Area Instance Statistic


db block gets + consistent gets, session logical
CPU Logical IO
reads
Latching v$latch gets, misses, sleeps
Parsing parse count (hard), parse count (total)
IO Read Physical IO physical reads, physical read requests
IO Write DML db block changes, redo writes, redo bytes
Concurrency Locking enqueue requests, enqueue waits
Commits user commits, rows/commit
SQL*Net roundtrips to/from client,
Network Transfers
SQL*Net roundtrips to/from dblink
sorts (memory), sorts (rows),
Memory SQL Sorting
v$system_event direct path write temp

If you have multiple samples (for example, you are running reports, pulling from the
AWR tables), you will know if a good unit of work has been chosen because the resulting
graph will look somewhat like a response-time curve. As Figures 9-9 and 9-10 demonstrate, it
won’t be perfect, but it should have an elbow in the curve.
A good unit of work will also help you identify the high-impact SQL that deserves
attention, forging a strong link between Oracle, the application, and the operating system. For
example, if there is a CPU bottleneck with the top wait event related to CBC latch contention,

349
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

then we would normally look for the top CPU-consuming SQL and the top logical IO SQL
(shown as Buffer Gets or simply Gets in AWR and Statspack). If we select logical IO as
our unit of work, we are likely to get a good response-time graph, and because the graph’s
arrival rate is based on logical IOs, we can naturally present how, by identifying and tuning
the high logical IO SQL, we will move out of the elbow of the curve. So, picking a good unit
of work is more than a technical exercise. It is also relevant in communication, performance
improvement strategy, and anticipation of the impact of the proposed solution.

Choosing the Level of Abstraction


Just as when you are asked what you do for a living and start with, “I work in IT,” when
initially and graphically conveying the performance situation using a response-time graph,
start at a very abstract level. Obviously, this is particularly important when presenting to a
nontechnical audience.
First, consider if numbers must be displayed. Showing numbers can lead to detailed
discussions that may not be necessary and can be distracting. If you show numbers, be ready
to answer questions like, “What is a user call and how does that relate to performance?” If
you don’t want to answer this because it is a clear distraction to your objectives, then do not
show these numbers.
I am not advocating misleading or misrepresenting the situation. I am advocating
appropriate abstraction and simplification to get the job done. You can drive down to the
details, but don’t go there unless it becomes necessary.
Figure 9-11 is an example of a very high-level response-time graph. With a graph like
this, you can help your audience understand three fundamental facts:
• Explain that the response time graph is a very abstracted perspective of what users
are experiencing. Make sure they understand as the workload increases, so does poor
performance. And the objective is to get the system out of the elbow of the curve.
Most people inherently understand (even though they may not be able to articulate it)
being in the elbow of the curve is bad and being to the left of the elbow is good. It
then follows that your solutions will somehow move the system out of the elbow of
the curve.
• The workload is so intense it is pushing performance degradation very high.
Highlight the vertical bar in the elbow of the curve (I even included an arrow in
Figure 9-11), so there is no doubt your audience understands the workload is much
too large. Everyone will naturally know that one solution is to reduce the workload.
• Dramatic performance degradation results when operating in the elbow of the curve,
which means even seemingly small workload changes can bring about dramatic
performance changes. This can be very frustrating to users who crave consistency
and dependability.

350
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-11. Shown is a highly abstracted response-time graph with minimal information. It
is used to convey the performance situation as unacceptable and clearly in the elbow of the
curve. People seem to intuitively know that being in the elbow of the curve is a bad thing.

If you feel it is necessary, then show the numbers. Be ready to explain them, how they
relate to performance, and how your solutions will alter the situation.
Figure 9-12 was created using the same data as the graph in Figure 9-11. The only
difference is that I included the numbers and used standard words (for example, “response
time”) and metrics (for example, “exec” for executions and “ms” for milliseconds). If asked
why the SQL execution metric is relevant, I may respond that there is a CPU bottleneck, and
in this system, the number of SQL statement executions directly impacts CPU consumption,
which affects the response time. As I’ll detail in later sections, you can also state that your
proposed solutions are aimed at reducing the execution rate and the impact of each execution.
After you have shown and described an abstracted response-time curve like the one in
Figure 9-11 or Figure 9-12, if your audience members are technical and will benefit from
seeing real data, and you have multiple samples, then show them a graph containing real data,
like the ones in Figure 9-9 and Figure 9-10. If you do show real data, be very well prepared to
keep control of the presentation, because you will be peppered with questions, many of which
will be irrelevant.

351
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-12. Shown is the same data as in Figure 9-11, but with slightly less abstraction.
Notice I use more traditional words, such as “response time” and “arrival rate,” and include
numeric values. If you include technical words and numbers, be prepared to explain what
they mean and how they relate to the performance situation.

As I’ll detail in the next section, using basic query theory math, you can construct a
graph similar to Figure 9-11 or Figure 9-12 with only a single peak time 1-hour interval
sample (for example, from Statspack or AWR).

The Five-Step Response-Time Graph Creation Process


To help you get started creating a response-time graph, I created a five-step process. You can
use this process regardless of the database server bottleneck and even if you have a single
sample or hundreds. Enjoy!

Know the System Bottleneck


If the database server is the bottleneck, then the database server bottleneck will be either CPU,
IO, or some lock/blocking (for example, enqueues) issue. Your graph will reflect either the
general queue time increase of an IO bottleneck or the steep and dramatic elbow of a CPU
bottleneck. Figure 9-8 is a good guide, as it contrasts both the CPU and IO bottlenecks.
Based on v$osstat data shown in Figure 9-6 and the reporting interval shown in
Figure 9-2, we calculated in the subsequent sections the server is running at 100% CPU
utilization. While the wait event situation is not shown, the Statspack report shows the top
wait event is clearly CBC latch contention. Based on the instance CPU consumption data
shown in Figure 9-3, the reporting interval shown in Figure 9-2, and the CPU core number
shown in Figure 9-6, we calculated an Oracle CPU utilization of 94%. Clearly, there is a CPU
bottleneck.

352
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Pick an Appropriate Unit of Work


When you choose an appropriate unit of work, the response-time graph will be a good
representation of the real system. This will make presenting the graph very natural and
understandable, and will naturally lead into your performance solutions discussion.
Following our example of a raging CPU bottleneck, we will use logical IOs as our unit
of work. Logical IOs consist of all buffer touches. Oracle distinguishes current mode buffer
touches by the db block gets instance activity statistic, and the consistent mode buffer
touches are signified by the consistent gets statistic. These two statistics will be
combined to produce a single logical IO statistic. Based on the instance statistics shown in
Figure 9-2, 1,307,632,010 logical IOs occurred during the Statspack reporting interval.

Determine the Service Time and Queue Time


As detailed in the previous sections, for each of your samples (perhaps a single sample or
hundreds), get the sample interval time, total CPU consumption (total service time), total non-
idle wait time (total queue time), and workload for your selected unit of work (total arrivals).
Then for each sample, calculate the arrival rate, service time, and queue time.
Continuing with our example, Figure 9-2 shows the sample interval to be 149.97
minutes in which the logical IO value (sum of db block gets and consistent gets)
is 1,307,632,010. Here is the arrival rate math:

lio 1,307,632,010lio 1m 1s
! lio = = " " = 145.32 lio ms
time 149.97m 60s 1000ms
Based on Figure 9-3, the total service time is 16,881.6 seconds, or 16,881,600
milliseconds. Determine the service time by dividing the total service time by the unit of work
value. Here is the service time math:

St:tot 16,881,600ms
St = = 0.0129 ms lio
! work:tot 1,307,632,010lio

Determine the queue time by dividing the total queue time by the unit of work value. For
Oracle systems, the total queue is all the non-idle wait time that occurred during the sample
interval. Most Statspack and AWR reports have a Top 5 Timed Events section near the top of
their reports. This is simply the top four most time-consuming wait events and also the CPU
time. Usually the top four wait events account for 90% or more of all the non-idle wait time.
For our required level of precision, we can simply sum the wait time for the top four wait
events. While the details are not shown, their combined wait time is 45,672 seconds during
the sample interval. Here is the queue time math:

Qt:tot 45,672s 1000ms


Qt = = " = 0.035 ms lio
! work:tot 1,307,632,010lio 1s

353
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

If Possible, Compare Utilizations


For CPU subsystems, you can compare the actual CPU utilization (perhaps gathered from
v$osstat or vmstat) with the classic utilization formula. If you picked a good unit of
work, the difference should be within 15%. If the bottleneck is the IO subsystem, because of
caching and batching, utilization comparison may be interesting, but it is unlikely to closely
match or provide much value.
Continuing with our example, the actual CPU utilization, based on the v$osstat
statistics shown in Figure 9-6, is 100%. To derive the CPU utilization, enter the calculated
arrival rate, service time, and number of CPU cores (also shown in Figure 9-6) as follows:

St ! 0.0129 ms lio "145.32 lio ms 1.87


U= = = = 0.94 = 94%
M 2cores 2
As we hoped, we are within 10%. If we are not within 15%, we can still use our graph
for informational purposes, but for numerically quantifying and anticipating our solution’s
impact (as described later), it will not be reliable.

Create the Response-Time Graph


It is finally time to introduce the basic response-time graph formula. The following is the
general response-time formula:

Rt = St + Qt

Here is the response time formula for CPU focused systems:

St St
Rt:CPU = St + Qt = =
# S "%
M
1! U M
1! t
$ M&

Here is the response time formula for IO focused systems:

St St
Rt:IO = St + Qt = =
1! U # S "%
1! t
$ M&

The first equation above simply states that response time is the sum of service time and
queue time. The second and third equations show the CPU and IO response time formulas
respectively including the utilization symbol and also with the utilization formula, which can
be handy if the utilization is unknown.6
Figure 9-13 is the response-time graph based on the Statspack report used throughout
this chapter and on which many of the preceding values and calculations are based (including

6
There is a more precise response-time formula based on Mr. Agner Krarup Erlang’s (1878-1929) famous
ErlangC formula used to study telephone networks. For details on this formula, see Forecasting Oracle Performance
(Apress, 2007), Chapter 3, “Increasing Forecast Precision.”

354
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

the queuing theory calculations shown in this section). The actual graph was constructed by
inputting the core statistics of sample interval timer, total workload units, total CPU
consumption, number of CPU cores, and total non-idle wait time into OraPub’s response-time
graph template (a Microsoft Excel-based tool).7

Figure 9-13. Shown is the response-time graph created using OraPub’s response-time graph
template. This particular graph is based on the data shown in the calculations in this section.
The peak arrival rate is clearly beyond what the system can process, and as we would expect,
severe performance problems are occurring.

Notice the peak arrival rate is deep in the elbow of the curve. The system is so busy and
the peak arrival rate intersects the response-time curve so high up that it dwarfs the service
time. It shouldn’t take much effort to convince your audience of this dire situation, preparing
them to embrace your solutions about how to get out of the elbow of the curve.
Before we embark on anticipating our performance solution’s impact, let’s look an
another example.

A Response-Time Curve for an IO-Bottlenecked System


The previous example was based on a two-CPU core system experiencing a raging CPU
bottleneck. Here, we will walk through the same process, but with a larger system
experiencing a classic multiblock IO read bottleneck.
This example is based on a real Oracle Database 10g Release 2 system with four CPU
cores. The performance data is based on a standard 60-minute interval AWR report. We will
complete each of the five steps outlined in the previous sections, resulting in the response-
time graph.

Know the System Bottleneck


The database server bottleneck is the IO subsystem. Simply put, Oracle’s IO read
requirements have exceeded the IO subsystem’s read capacity. We will expect our response-

7
This tool is available for free from OraPub’s web site. Locate it by searching for “firefighting.”

355
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

time graph to reflect the general queue time increase of an IO bottleneck, which has a
continual and steady increase in queue time until the elbow of the curve is reached, and then
response time skyrockets.
While the wait event situation, shown in Figure 9-14, looks like an IO bottleneck,
especially with the 20 ms average db file scattered read time, there could be also
be a CPU bottleneck. To double-check, calculate both the operating system and the Oracle
CPU utilization. Using Figure 9-15 to calculate Oracle CPU requirements and considering
both server process (DB CPU of 2,065.87 seconds) and background process (background
cpu time of 25.95 seconds) CPU consumption, the total instance CPU consumption is
2,091.82 seconds.8 The database server CPU capacity is based on the four CPU cores and the
60-minute reporting interval.

Figure 9-14. Shown is a snippet from both v$system_event (wait events) and
v$sys_time_model (v$sys_time_model, DB CPU). Oracle does not include
background CPU when calculating “CPU time.”

The following is the Oracle CPU utilization calculation:

R 2,091.82s 1m 2,091.82
U= = ! = = 0.147 = 15%
C 4cores ! 59.31m 60s 14,234.40

8
When the Statspack and AWR reports base CPU conumption on the v$sys_time_model view, they
incorrectly include only server process CPU consumption (DB CPU) in the Top 5 Timed Event report, and do not
include any background process CPU consumption (background cpu time). Notice the DB CPU time shown in
Figure 9-15 matches the CPU Time statistic shown in Figure 9-14.

356
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Oracle processes are consuming only 15% of the available CPU capacity. Unless there is
another instance or other processes consuming CPU, we would expect the operating system
CPU utilization to between around 1% to 10% higher than the Oracle CPU utilization. While
not shown in a figure, the v$osstat BUSY_TIME statistic is 228,056 cs, which means all
operating system processes during the 60-minute interval consumed 2,280.56 seconds of
CPU. Placing the CPU consumption (requirement) value into the utilization formula, we see
the operating system CPU utilization is only 16%. So at this low CPU utilization, the
operating system overhead is minimal.

R 228,056cs 1s 1m 228,056
U= = ! ! = = 0.160 = 16%
C 4cores ! 59.31m 100cs 60s 1,423,440

Figure 9-15. Shown is an AWR report snippet from the v$sys_time_model used in this
exercise. The total CPU consumption (service time) during the 60-minute interval is 2,091.82
seconds. This includes both server process (DB CPU) and background process
(background cpu time) CPU consumption.

Clearly, there is no CPU bottleneck. Combined with the Top 5 Timed Events report
snippet shown in Figure 9-15, we can see there is an IO bottleneck.

357
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Pick an Appropriate Unit of Work


Because it’s obvious that there is an IO read bottleneck, the number of server process IO read
requests should be a good unit of work. This is the instance statistic (v$sysstat)
physical read IO requests. Oracle tracks physical IO by SQL statement, allowing
our response-time mathematics, the response-time curve, our performance-improving
strategy, and communication to be easily understood and well founded. If there were an IO
write bottleneck, the instance statistic db block changes would be another good unit of
work candidate.

Determine the Service Time and Queue Time


The AWR Instance Statistics section showed the physical read IO requests
(breads for short) statistic to be 148,439. Before we calculate the service and queue times, the
arrival rate based on our unit of work needs to be calculated. Here is the arrival rate math:

breads 148,439breads 1m 1s 148,439


!= = " " = = 0.042breads ms
time 59.31m 60s 1000ms 3,558,600
Based on Figure 9-14, the total service time is 2,091.82 seconds, or 2,091,820
milliseconds. Determine the service time by dividing the total service time by the unit of work
value. Here is the service time math:

St:tot 2,091,820ms
St = = = 14.09 ms breads
! tot:breads 148,439breads

Determine the queue time by dividing the total queue time by the unit of work value.
While the detailed wait event listing is not shown, even by looking at the top four wait events
shown in Figure 9-15, we can infer these account for 90% or more of all the non-idle wait
time. For the necessary level of precision, I simply added the wait times for the top four wait
events. Their combined wait time is 1,867 seconds during the sample interval. Here is the
queue time math:

Qt:tot 1,867s 1000ms


Qt = = " = 12.58 ms breads
! tot:breads 148,439breads 1s

If Possible, Compare Utilizations


Since this system is undergoing an IO bottleneck, computing the IO utilization will not add
much value and may actually cause more unimportant questions to be asked (creating
unnecessary distractions).

Create the Response-Time Graph


Creating the IO response-time graph is a little tricky because you never see response time
solved for the number of devices (M). Looking at the IO focused response-time equation
below, we know every variable except for M. We know and have calculated above the

358
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

response time’s core components, service time and queue time. Here is the core IO centric
response time formula:

St
Rt:IO = St + Qt =
# S "%
1! t
$ M&

Solving the IO focused response time formula for M:9

St ! (St + Qt )
M=
Qt

Based on this exercise’s AWR data, the various values gathered and derived were
entered into OraPub’s response-time graph template tool, resulting in the graph shown in
Figure 9-16.10 The tool is very simple to use and requires only the data presented in this
example.

Figure 9-16. Shown is the response-time graph for this exercise. It shows the service time,
queue time, response time, and the arrival rate as reported from the AWR report.

With the key queuing theory calculations performed and the response-time graph
created, we are nearly ready to move on to anticipating the performance improvement impact
of our various solutions. However, before we get to that topic, it is important to understand
the ways we can alter the users’ experience.

9
To check the math, go to http://wolframalpha.com and enter, s+q=(s/(1-(s l/m))), solve m
10
To create graphs highlighting your area of interest, it is often helpful to set the maximum values for the x
axis and y axis. This can be done by right-clicking on the axis, clicking on the scale tab, and manually entering the
maximum axis value.

359
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

How to Improve the Performance Situation


When it comes to improving performance, the bottom line is to get out of the elbow of the
curve. As I have mentioned, when presenting a response-time curve, even nontechnical
audiences quickly grasp that being in the elbow is “bad” and being out of the elbow is “good.”
Use this intuitiveness to demonstrate—even at a very high level—your performance-
improving strategies. This will build confidence in your solutions and also help more
effectively rank them.

Tuning: Reducing Requirements


Tuning Oracle, the application, or the operating system effectively reduces its requirements.
For example, instead of a SQL statement consuming 5 seconds of CPU, it now consumes only
2 seconds of CPU. Thinking about the basic utilization formula of requirements divided by
capacity, if requirements decrease and capacity remains the same, then the utilization must
decrease. The only way to increase the utilization once again is to increase the requirements.
One way to do this is to increase the workload; that is, the arrival rate. So, through tuning, we
have provided the basic performance-enhancing options of decreased response time, increased
throughput, or some combination of both.
From a queuing theory perspective, what really happens when service time drops is that
a new response-time curve takes effect. Because the service time decreases, with no load on
the system and therefore no queuing, the response time with minimal arrivals is less. So, the
curve has shifted down. But it gets better. Because each transaction server (for example, a
CPU core) can process each arrival quicker, it can process more arrivals per unit of time
before queuing sets in, which shifts the graph to the right. So, tuning shifts the response-time
curve down and also to the right!
Figure 9-17 graphically shows how tuning can affect a system. Starting at point A, the
performance is unacceptable and highly variable. By tuning the application, Oracle, or the
operating system, the response time decreases (that is, improves), and the system is operating
at point B. However, now the administrators have a choice. By controlling the workload (the
arrival rate), they can allow more work to flow the system without affecting the response
time. Point C shows this negligible affect on response time by allowing the arrival rate to
increase. So again, tuning provides the performance analyst with several options: decreased
response time, increased workload, or a managed combination of both!

360
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-17. Shown is the response time effect of tuning. By tuning, a new response-time
curve takes effect (dotted line), and the response time drops from point A to point B. By
controlling the workload, performance can remain at point B or by allowing the workload to
increase to point C, the system can still maintain both improved response time and an
increased workload.

Buying: Increasing Capacity


When additional or faster CPUs or IO devices are added to a system, we have effectively
increased capacity. For example, because the old CPUs were replaced with CPUs that are
twice as fast, instead of a SQL statement consuming 4 seconds of CPU, it now consumes only
2 seconds of CPU. Or perhaps six additional CPU cores were added. Thinking about the basic
utilization formula of requirements divided by capacity, if capacity increases and the
requirements remain the same, then the utilization must decrease. The only way to increase
the utilization is to increase the requirements. One way to do this is to increase the workload
(the arrival rate). So, by increasing capacity, we have provided the basic performance-
enhancing options of decreased response time, increased throughput, or some combination of
both.
From a queuing theory perspective, what really happens when capacity is added depends
on if additional transaction processors (think more CPU cores) are implemented or the
transaction processors are faster (think faster CPUs)—or if we’re lucky, both. If the
transaction processors are faster, the service time decreases with the same effect as with
tuning. We can expect a new response time curve similar to the one shown in Figure 9-17 to
take effect. However, if we add transaction processors with no change to service time, the
response-time curve does not shift down. But because there are more transaction processors
available, as a whole, they can process more transactions per unit of time, which shifts the
curve to the right, allowing for an increase in the arrival rate before queuing sets in.
Figure 9-18 graphically shows how adding more transaction processors can affect a
system. If the bottleneck is IO, then the same general effect occurs when adding IO devices.
Starting at point A, the performance is unacceptable and highly variable. By implementing
addition transaction processors, the response time decreases (that is, improves), and the

361
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

system is operating at point B. However, now the administrators have a choice. By controlling
the workload (the arrival rate), they can allow more work to flow through the system without
affecting response time. Point C shows this negligible effect on response time by allowing the
arrival rate to increase. So, by adding more capacity (either more and/or faster transaction
processors), the performance analyst once again has several options: decreased response time,
increased workload, or a managed combination of both.

Figure 9-18. Shown is the response-time effect of increasing capacity by adding transaction
processors (for example, CPU cores). By adding CPU cores, a new response-time curve takes
effect (dotted line). The response time drops from point A to point B. By controlling the
workload, performance can remain at point B, or by allowing the workload to increase to
point C, the system can still maintain both improved response time and an increased
workload.

Balance: Managing Workload


Workload management can provide arguably the most elegant of all performance
improvements. And of all the missed performance-improving opportunities, I would say better
workload management has got to be near the top. While shifting workloads may not be a very
satisfying technical challenge (though it can be), when the workload is better managed, peak
workload and painful performance periods can be dramatically improved. And all this can
occur without tuning Oracle, the application, or the operating system, and without any capital
investment.
Suppose around time 13 in Figure 9-19 is when users are extremely upset. It’s not time
23, because the users are asleep and the batch jobs are running just fine. The performance
analyst must determine what is occurring—that is, the workload mix—during time 13 and
work with the user community to shift a segment of that workload, perhaps to time 15. While
this may seem unlikely, when confronted with a severe performance problem, a graphic
clearly showing the situation (for example, Figure 9-19), users can be surprisingly flexible.
But if they are told to change the way they work without understanding why, they will most
likely rebuff any attempt to alter the workload.

362
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-19. Shown is a workload graph, which appears to have ample workload-balancing
opportunities. By moving some of the workload during painful peak processing time to
nonpeak processing times, the workload requirements during peak times are effectively
reduced. This decreases response time, allows for increased workload of a specified type, or
some combination of both.

In some cases, the users may not even be aware of the workload shift. For example,
during a consulting engagement, I noticed a messaging process woke every 30 seconds to
check for messages to transfer and then performed the transfer. I discovered even with only a
few messages, there was a tremendous amount of overhead. I asked the application
administrator (not the end users) if the message process could wake up every 5 minutes just
during the peak processing times. To my surprise, he willingly embraced this rather elegant,
zero downtime, and zero capital investment performance-improving solution.
From a queuing theory perspective, when the workload is better balanced, the arrival
rate is reduced. Figure 9-20 graphically shows when the arrival rate is decreased, we moved
from point A to point B. When the arrival rate is decreased, system requirements decrease,
resulting in a utilization decrease, as well as a response time decrease. Unlike with the tune
and buy options, there is no response-time curve shift. What has shifted is the workload; that
is, the system has traveled along the response-time curve. This is usually difficult to initially
understand. But consider that when the workload has decreased, there is no change in the
service time, as it takes a transaction processor just as long to process a single transaction as
before. Therefore, the response-time curve does not shift down. The response-time curve does
not shift to the right because no additional capacity has been implemented. What changed is
the arrival rate, so we simply move along the existing response-time curve as the response
time decreases. As we move to the left, while service time remains the same, the queue time
will decrease, resulting in an improved response time.

363
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-20. Shown is the effect of workload balancing on response time. During peak
processing time, the response time is at point A. By shifting some of the workload to another
time, the arrival rate is reduced from 24 trx/ms to 18 trx/ms, resulting in a significant
response time reduction. Notice the response-time curve does not shift, but rather the system
activity has shifted.

Anticipating Solution Impact


Now, we finally answer the question posed at the beginning of the chapter: What kind of
performance improvement can we expect? I’ll start with some direction and words of caution.
Then I will move directly into a series of exercises to demonstrate step by step how you can
creatively apply all the material presented in this book to anticipate the impact of a solution
and improve the situation.

Simplification Is the Key to Understanding


With our performance-improving objective in mind, we will continue to responsibly simplify
and to use abstraction. Simplification is the key to understanding, and to communicating
technical concepts and information. We want to make our performance presentation
memorable. We want to motivate change.
By making the numbers and the concepts simple, your audience will quickly understand
you and be able to draw the same striking conclusions as you have. Purposely and deliberately
be imprecise, while not misleading or being incorrect. For example, unless it is absolutely
necessary, do not show a number like 3.14156; instead, show 3.
If you are successful, your audience will come away with the same conviction you have
in implementing the solutions. In addition to this, the decision makers will have useful
information they understand, allowing them to determine what solutions should be
implemented, in what order and when, and by which group.
In summary, to be understood, simplify.

364
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

A Word of Caution
As we move more fully into anticipating change, which is a gentler term for forecasting,
predictive analysis, and capacity planning, be very cautious. The concepts and techniques I
have presented in this book so far, and what remains, are not meant for deep predictive
analysis. As I continue to state, our objective is to anticipate the impact of solutions. Use
general and imprecise words and numbers to convey change, movement, and direction.
The main reason the forecasting techniques presented in this book are not up to
predictive analytics snuff is because they are not validated and may be based on a single
sample. The math is fine, but I have purposely not brought you through the steps to create a
robust forecast model and then to validate the model so you can understand its precision and
usefulness. For our objectives, which are very general and imprecise, the increased precision
and complexity are not necessary. If you desire more precision, a number of technical papers,
books, and some training opportunities are available.11

Full Performance Analysis: Cycle 1


Here’s the situation: Users are angry, very angry. The Oracle Database 10g Release 2 online
system residing on a single four-CPU core Linux server is taking much too long when simply
querying for basic information, like a customer. But it’s not just one, two, or three queries—
it’s the entire system. You have been assigned to diagnose the system, recommend solutions,
and clearly explain the solutions, including reasonable expectations for their impact on
performance.
The worksheets shown in the figures in the remainder of this chapter are all contained
within a single Microsoft Excel workbook, the firefighting diagnostic template workbook.12
Only the yellow cells require input. If you enter information into a cell that isn’t shaded
yellow, you are typing over a formula! Data entered in one of the worksheets may be
referenced in another worksheet. For example, the Oracle CPU consumption shown in the
Oracle-focused worksheet (Figure 9-21) is referenced in the operating system CPU analysis
worksheet (Figure 9-22). All the data-entry fields have been previously discussed. If you need
more information about their source, please review the appropriate section in this book.

Oracle Analysis
To summarize, the Oracle subsystem is being forced to ask for blocks outside its cache. While
the operating system returns these blocks extremely fast, the number of requests results in a
significant portion of the total response time. From a purely Oracle perspective, we can easily
reduce the queue time by 20% by simply increasing Oracle’s buffer cache.
Figure 9-21 provides the core Oracle diagnostic information collected over a 30-minute
interval in a response-time analysis format. At this point in the book, you should know the
service time CPU information came from v$sys_time_model and the queue time
information came from v$system_event. All wait events that consumed more than 5% of
the wait time during the reporting interval are included in this analysis and shown in Figure 9-
21. Clearly, the top wait event is db file scattered read, yet the average wait time
is only 0.093 ms! So, it’s the classic situation where the requested blocks are not in Oracle’s

11
These are listed on the OraPub web site. To find them, review OraPub’s training schedule and/or do a search
for “forecast.”
12
This tool can be downloaded for free from OraPub’s web site. Just do an OraPub search for “firefighting
template.”.

365
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

buffer cache, but the operating system retrieves them very quickly. If the system were
bottlenecked, we would expect to find a raging CPU bottleneck. Otherwise, the sheer number
of buffers Oracle must process, combined with the CPU speed, is resulting in unacceptable
performance.

Figure 9-21. Shown is the ORTA information entered into a firefighting diagnostic template,
which makes diagnosing, analyzing, and anticipating change impact much simpler. Clearly,
db file scattered reads events are the issue. While the CPU subsystem capacity is
not shown, Oracle is consuming only 26% of the available CPU resources.

Calculating Averages
I’ve included one new piece of information in Figure 9-21. Notice that column K is the
average type (Avg Type). Two different types of average calculations are shown in Figure 9-
21: straight and weighted.
While the straight average for the IO read wait times is not shown in the figure, it is
calculated as follows:

0.093 + 0.107 + 0.010 0.210


Avg = = = 0.070
3 3
However, the scattered read wait time of 0.093 ms occurs much more frequently than the
other waits. A more accurate average calculation would take into account the occurrences of
the scattered read waits in addition to its average event wait time. This is called a weighted
average, and it is a much better average calculation when working with diverse and highly
variant data sets, as we have here.
The average calculation, weighted by the total wait time (which includes the weight
occurrences) is shown in Figure 9-21. It is calculated as follows:

( 754.650 ! 0.093) + (5.410 ! 0.107) + ( 4.520 ! 0.010)


WA = = 0.093
754.650 + 5.410 + 4.520
366
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

If you think about it, the weighted concept makes sense. Because the scattered read
waits happen so much more often, the average IO read time should reflect this and be pulled
toward the scattered read wait times. As Figure 9-21 shows, the weighted average value
actually rounds to the average scattered read wait time of 0.093. While the difference may
seem insignificant, not only can this have a dramatic effect when anticipating the impact of a
performance solution, but it also makes the averages more realistic.
Reducing Queue Time
Our Oracle-focused solutions will concentrate on service time, queue time, or both. One
solution to reduce the scattered read waits is to increase Oracle’s buffer cache. There is plenty
of memory, and (as shown shortly) there is also plenty of CPU available to handle the
possible increase in cache management resources. Based on the size of the tables involved, a
1GB buffer cache should be able to cache the entire customer table.
Because the total queue time accounts for nearly 30% (28.9%) of the total response time,
if queue time is eliminated, total response time could improve by as much as 30%. But there
will likely be some other queue time, so to be conservative; let’s say we anticipate a 20%
decrease in queue time.
Reducing Service Time
Total service time accounts for nearly 70% of the sample interval’s total response time.
Clearly, there is a opportunity here for improvement. How we reduce the service time may not
be so easy in practice. While there are possibilities to reduce service time from both an
operating system and an application perspective, from an Oracle perspective, a
straightforward tweak is not apparent. This is not a problem because of the potentially
massive performance improvement achieved by increasing the buffer cache to reduce the total
queue time and also by tuning key CPU-intensive SQL statements (as explained in the
discussion of the next analysis cycles).

Operating System Analysis


To summarize, the operating system is not experiencing a shortage of capacity in the classic
sense. The Oracle system is predominantly consuming CPU resources, yet due to Oracle and
operating system scalability limitations, the operating system CPU is only 28% utilized. As a
result, an Oracle server process is primarily bound by CPU speed, which translates into
service time.
Figure 9-22 provides the metrics for our operating system investigation. While not
shown, neither the network or memory subsystem is an issue.

367
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-22. Shown is the operating system analysis information entered into a firefighting
diagnostic template. There is no CPU or IO bottleneck. Oracle is consuming 26.1% of the
CPU resources, and based on vmstat observations and v$osstat data, CPU utilization is
around 28%. The IO subsystem is responding to IO requests in less than 1 ms! All IO data
was gathered from the Oracle v$sysstat performance view.

CPU Subsystem
The CPU subsystem consists of a single four-CPU core. Based on both vmstat observations
and v$osstat view data, on average, the CPUs are about 28% utilized. Over the 30-minute
data collection interval, the CPU subsystem has the capacity to supply 7,203 seconds of CPU.
Based on the Oracle service time analysis, Figure 9-21 shows Oracle consumed about 1,880
seconds of CPU, meaning Oracle consumed about 26% of the available CPU resources.
From a CPU subsystem perspective, it is not possible to increase scalability by somehow
splitting a single Oracle server process activity onto multiple CPU cores. Our only option is to
decrease total service time to use faster CPUs. Because of cost and budgetary timing issues,
we do not want to entertain this option unless absolutely necessary. So at this point, we will
not seek to improve performance by increasing the CPU subsystem capacity.
IO Subsystem
Based solely on the v$sysstat performance view, the IO subsystem is receiving read
requests at nearly 530 MB/s. Oracle read requests (db file scattered read) are
being satisfied in less than 1 ms, which indicates the Oracle blocks reside in the operating
system buffer cache! While not shown in Figure 9-22, the average IO device utilization is
around 2%, meaning they are idle.

368
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

The Oracle and application tuning strategies are intended to reduce the number of IO
read calls, making an increase in IO activity and subsequent IO performance issues highly
unlikely.

Application Subsystem
To summarize, by reducing both physical and logical block activity, performance can be
significantly improved. This means SQL tuning and/or reducing SQL statement execution
rates. Figure 9-23 shows there are three high-consuming physical IO SQL statements, with the
top statement consuming more than twice as much physical IO as the second and third ones
combined! Figure 9-24 shows the system is processing nearly 70 logical IOs each
millisecond.

Figure 9-23. Shown is the essential application SQL information entered (obviously copy and
pasted) into a template. All the information was gathered during the 30-minute collection
interval from v$sql and represents only what was processed during the collection interval.
Notice the most resource-consuming statement is not the slowest and consumes no more
resources per execution than other statements. It’s the combination of execution rate and per-
execution resource consumption that makes it stand out.

The Oracle analysis has directed us to the most important application SQL, which is
SQL needing blocks that do not currently reside in Oracle’s buffer cache. By focusing on
SQL with the highest physical IO consumption, we can significantly reduce the application
impact. There is no guessing or gut feeling about this. It is a fact. However, we expect the
Oracle-focused solution of increasing the buffer cache to have a profound impact, and the
change requires only a single parameter adjustment and an instance cycle. We will want to
reanalyze the situation during the second analysis cycle. So at this point in the analysis, we
will wait before suggesting any application changes.
Figure 9-24 shows common workload metrics we will combine with our Oracle and
operating system analysis when building our response-time graphs and anticipating change.
Figure 9-24 also provides two distinct informational aspects: the workload metrics in both
seconds and milliseconds, and response-time-classified details. It provides these details by
calculating the appropriate resource consumed (for example, CPU consumption) divided by
the workload metric activity during the reporting interval. For example, each logical block
processed consumed 0.01507 ms—that is, 0.01507 ms/lio. This is the logical IO service time
and can be useful when constructing a response-time curve based on logical IO activity.

369
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-24. Shown is the workload diagnostic information. Notice only the total interval
workload values and the interval (sample) time require entry. The workload information,
combined with the ORTA, provides a plethora of diagnostic data we will use when
anticipating performance solution impact.

Response-Time Graphs
Our ORTA shows response time can be significantly reduced by focusing on both physical
block IO (queue time) and logical block IO (service time). To more clearly convey the
situation and help others come to the same performance-enhancing conclusions as we have,
we will create two response-time graphs: one focused on the CPU subsystem and the other on
the IO subsystem.

Figure 9-25. Shown is a response-time graph created using OraPub’s response-time graph
template based solely on data shown in this example’s related figures. This response-time
graph focuses on the CPU subsystem, so we chose logical IO as our workload metric. As
expected, the system is not operating in the elbow of the curve. Since there is virtually no
queue time per logical IO processed, improving performance will be the result of decreasing
service time by either influencing the optimizer to choose a better execution plan, tune Oracle
to be more efficient, or use faster CPUs.

Figure 9-25 shows the response-time graph based on logical IO processing during our
reporting interval. The response-time graphs for this example were created as described

370
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

previously in this chapter, using OraPub’s response-time graph template. The logical IO
workload metric was chosen, since it typically has a high correlation to CPU consumption.
Because there is virtually no queue time related to process a logical IO, to reduce the
LIO response time, the service time will need be decreased. The trick to reducing service time
is to figure out a way for each logical IO to consume less CPU. There are many ways to do
this; using faster CPUs, tuning Oracle to be more efficient, or influencing the optimizer to
choose a more efficient execution plan. During the second analysis cycle, we will focus on
this tuning approach.

Figure 9-26. This response-time graph focuses on the IO subsystem, so we chose physical IO
as our workload metric. IO subsystems nearly always exhibit some queue time, and this
situation is no different. Physical IO requests do include a significant amount of queue time,
so we have multiple ways to reduce the physical IO-related response time. However, on this
system, physical IOs are satisfied so quickly that the best course of action is to simply
eliminate them by increasing the buffer cache.

Figure 9-26 shows the response-time graph based on physical IO processing during our
reporting interval. The physical IO workload metric was chosen because it typically has a
high correlation to IO requests and directly relates to our application analysis.
As expected, there is significant queue time involved with our IO requests. As presented
previously, there are multiple ways to reduce the queue time and also the service time. One of
our performance-improving strategies is to virtually eliminate all physical IO requests,
essentially changing the arrival rate to zero. While the service time theoretically will not
change, because the number of physical IO requests will be drastically reduced, Oracle will
not need to spend so much time placing blocks into the buffer cache. This effectively reduces
the CPU time spent per logical process, resulting in a reduction in service time. It will be
interesting to see what actually occurs!

What-If Analysis
Now let’s combine our recommendations with response-time mathematics to anticipate
change. The second analysis cycle will show the actual effect of our changes!

371
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

To summarize this exercise’s performance situation, the online users are experiencing
poor performance due to Oracle being required to retrieve blocks from the IO subsystem and
then process them. There is plenty of CPU, IO, and memory capacity. It just needs to be
shifted to maximize performance. The planned shifts are to increase Oracle’s buffer cache to
virtually eliminate all physical IO requests and to tune the most CPU-consuming SQL
statement. Both changes will have a dramatic performance improvement impact.
It is always best and more reliable to focus on one change at a time. As anyone working
in IT has experienced, multiple simultaneous changes can have unanticipated effects. We
need to know the impact of each change. Therefore, only one change will be implemented at a
time.
Because it’s the easiest to implement and should result in a significant performance
improvement, we will increase the buffer cache first. Increasing the buffer cache to 1GB will
effectively cache the customer table, resulting in virtually no physical IO requests. Figure
9-26 shows if the physical IO arrival rate drops to nearly zero the resulting response time will
not significantly be reduced. So our goal is to, as best we can, eliminate the number of actual
physical IO requests. Figure 9-21 shows physical IO accounts for nearly 30% of the total
Oracle response time. By eliminating nearly all the physical IO, assuming the users don’t
perform more work and we don’t hit some locking type of performance issue (for example,
row-level locking), physical IO intensive process performance will improve by 30%.
Users who unknowingly run multiple queries at the touch of a button, waiting for the
application to return control to them and getting upset, are highly likely to feel the effect of
the Oracle in-memory queries! Figure 9-24 shows that both logical and physical IO response
times are around 0.022 ms. In Figure 9-23, notice that all of the top ten statements have nearly
the same number of physical IO and logical IO activity. This means that by eliminating the
physical IO—that is, ensuring the tables are completely cached—we would expect the elapsed
time to drop by half. Figure 9-23 shows the top SQL statement average elapsed time is 0.632
sec/exec. By increasing the Oracle buffer cache, average elapsed time is likely to drop to
perhaps 0.316 sec/exec (0.632/2).
With an increase in performance (decrease in response time), users may perform work
more rapidly, increasing the workload. The more time users are sitting and waiting for the
application to return control to them, the more of a workload increase we can expect to see. It
is possible a significant increase in the workload could offset the gain in response time. But
either way, the users win. If they don’t increase the workload, online performance increases.
If they do significantly increase the workload, they will get more work done!
When making a statement like this, someone is likely to ask how much more throughput
can be expected. That’s when it’s time to once again show the response-time graph in Figure
9-25, which illustrates the current situation in a pure logical IO (CPU) perspective. If the
service time does not decrease (this is detailed in the second analysis cycle), then it appears
we can nearly double the workload before response time significantly increases. Unless you
have a way to control the user workload or understand the application very well, there may be
no way of knowing if the users can or will increase the workload. But regardless, you have
graphically and with simplified numbers demonstrated the performance impact of increasing
the buffer cache.
To see what actually occurred when the buffer cache was increased to 1GB, read on!

372
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Full Performance Analysis: Cycle 2


As described in the previous section, an Oracle buffer cache increase was chosen as the first
performance-enhancing change. We are anticipating around a 30% decrease in total Oracle
response time, and for users who run multiple serial queries at a single touch to feel around a
50% decrease in response time. We are not sure if users will be able to take advantage of
faster response time and get more work done, but it won’t surprise us if they can.

Oracle Analysis
As Figure 9-27 shows and as we expected, physical IO has been virtually eliminated. The
total service time has also decreased. We were hoping for a 30% drop in total response time.
But when comparing the Oracle analysis shown in Figure 9-21 to Figure 9-27, we can see the
total response time decreased from 2,644 seconds to 1,257 seconds, which is a 52%
improvement! (A direct comparison was possible because the collection interval was the
same: 30 minutes.) The large drop in service time is due to less cache management related to
placing blocks into the buffer cache. The decrease in Oracle’s CPU consumption should result
in a drop in CPU utilization. Any further performance gain should now focus on reducing
CPU consumption, which is squarely focused on heavy logical IO SQL.

Figure 9-27. Shown is the 30-minute interval ORTA as a result of the buffer cache increase.
Comparing this to Figure 9-21, as expected, total queue time has been virtually eliminated.
Total service time has also decreased due to less buffer cache management related to placing
buffers into the cache.

Operating System Analysis


Figure 9-28 shows the operating system is looking even better than before! The CPU
utilization dropped from 28% to 20%, and the IO subsystem is receiving from Oracle less
than 1 MB/s in read and writes. So, it appears that increasing the Oracle buffer cache has had
a very positive effect on resource consumption.

373
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-28. Shown is the operating system analysis information. Compared to Figure 9-22,
Oracle CPU consumption dropped to 17%, and the operating system CPU utilization dropped
from 28% to 20%. Since the utilization significantly dropped, we should not expect a large
increase in the workload.

At this point, the only way to decrease CPU-related response time is to either use faster
CPUs or reduce the SQL statement logical IO consumption (tune or balance). While
additional CPUs may provide more CPU capacity, Oracle and the operating system are not
able to fully take advantage of the existing four cores (for details, see the scalability
discussion near the end of this chapter).

Application Analysis
The application situation has indeed changed, as shown in Figure 9-29. First, we can see that
no significant physical IO is being consumed! Thus means increasing the buffer cache had its
intended affect. We were hoping for a 50% decrease in elapsed time, to around 0.316
ms/exec. What actually occurred was an elapsed time drop from 0.632 to 0.266, which is a
58% decrease in response time! So, we met and exceeded our objective. It appears the users
are also able to get more work done because the SQL statement execution rate increased from
25.6 exec/sec (see Figure 9-24) to 27.1 exec/sec (Figure 9-30).

374
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-29. Shown is the essential application SQL information. Notice there is no physical
IO consumed. Compared to Figure 9-23, the top SQL statement’s elapsed time per execution
improved from 0.632 ms/exec to 0.266 ms/exec, while at the same time, the number of
executions during the sample interval increased from 473 to 536.

Figure 9-30 shows logical IO response time decreased to 0.009354 ms/lio from 0.02199
ms/lio (Figure 9-24). Clearly, there was a significant service time change. This means initially
Oracle was burning CPU cycles on other tasks besides accessing buffers that already resided
in the cache. This is another example of the overhead involved in bringing buffers into
Oracle’s cache and updating all the related memory structures. As a result of the service time
drop, the response-time curve will shift down and to the right, as shown generally in Figure 9-
17 and especially in Figure 9-31. This explains why SQL statement elapsed time decreased
and utilization decreased, while the workload increased.

Figure 9-30. Shown is the workload diagnostic information. Compared to Figure 9-24,
logical IO response time dropped from 0.02119 ms/lio down to a staggering 0.00935 ms/lio.
In addition, the overall logical IO workload increased from 69.29 lio/ms to 74.66 lio/ms,
representing an 8% increase. So again, performance has improved while the workload has
also increased.

Figure 9-31 shows the initial and current (buffer cache increase) response-time curves
using logical IO as the workload metric. The variables used to create the response-time curve
are four CPU cores (M=4); service times (St) of 0.01507 ms/lio and 0.009326 ms/lio for the
initial and increased buffer cache situation, respectively; and their various arrival rates of 69.3
lio/ms and 74.7 lio/ms, as indicated on the graph as points A and B, respectively. Because
Oracle now consumes less CPU per logical IO, the service time for logical IO decreased. As
shown graphically in Figure 9-31, the performance situation changed from point A to point B,

375
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

allowing both improved SQL statement elapsed time combined with an increase in workload
and a reduction in CPU utilization.

Figure 9-31. Shown is the response-time curve shift as a result of the logical IO service time
decrease (improvement). Not only does this increase performance with no workload change
(69 lio/ms), but in the current situation (point B), the response time remains improved along
with an 8% workload increase (74 lio/ms).

What-If Analysis
Now let’s suppose the users would like even more of a performance improvement. Based on
the OraPub 3-circle analysis, one obvious place to squeeze more performance out of the
system is a reduction in logical IO, which will reduce the total CPU consumption. This can be
accomplished by reducing the total number of LIOs. Figure 9-31 shows that to reduce LIO
response time, the LIO service time must be reduced. This means we must reduce the CPU
consumption per LIO. The most direct way to accomplish this is to influence the optimizer to
choose a more efficient execution plan, thereby reducing the CPU consumed per LIO.
Typically we also receive the added benefit of reducing the number of total LIOs processed.
As shown in Figure 9-29, the statement with a SQL_ID ending in d6w consumed nearly
22.7 million logical IOs during the 30-minute reporting period, which is about 42.3 thousand
logical IOs during each of its 536 executions. By tracing the d6w SQL statement, it was
confirmed that a typical execution touches around 42.3 thousand logical buffers. It was also
obvious that the large customer table was being full-table scanned! By simply creating an
index on the status column and rerunning the query, only three logical buffers were
touched. (While indexing a status column usually will not produce an improvement like
this, in this application, it was indeed the case.) This means even if the statement is run 536
times, only 1,608 buffers will be touched. And since each logical IO consumes around
0.00933 milliseconds of CPU (LIO service time), during the 30-minute interval, the statement
should consume only 15.003 seconds of CPU (1,608 × 0.00933). Keep in mind the 0.00933
ms figure is the average CPU consumption per LIO over the entire sample interval.

376
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

But the impact is more far-reaching, because creating an index on the status column
also impacts three other statements out of the top five logical IO statements. The other three
statements also touch only three logical IOs per execution. Of the five top logical IO
statements, only the statement with the SQL_ID ending in ggt is not improved by the new
index. As you’ll see later, the lack of a thorough index impact analysis will have unintended
consequences.
Table 9-3 details one way to calculate the CPU consumption change for multiple SQL
statements. By creating the status column index, each of the queries will consume only
three logical IOs per execution. Based on their number of executions during the 30-minute
sample interval, the expected logical IOs are calculated. Since each logical IO consumes
around 0.00933 ms of CPU, the expected total CPU consumption per tuned SQL statement is
calculated. When combined, the tuned statements will now consume only 0.0231 second of
CPU, compared to the initial 230.553 seconds.

Table 9-3. Determining CPU savings when using a status column index

Current Total Total Expected Expected Total


SQL_ID
CPU (ms) Execs LIOs CPU (ms)
fg8cnnjrf2d6w 142,797 536 1,608 15.0
cyfcvf5k75npm 65,528 243 729 6.8
ggaj1gzcj3gxp 10,872 23 69 0.6
680t2uhr9tqqb 11,356 24 72 0.7
230,553 23.1

While the improvement seems dramatic, only when users trigger multiple and serial
executing SQL statements are they likely to feel any difference. Additionally, total sample
interval Oracle service time is 1,254 seconds (Figure 9-27), so a decrease of around 230
seconds may not result in much of a utilization improvement. But let’s do the math, create the
index, and see what happens.
Subtracting the CPU consumption from the statements shown in Table 9-3 (230,553
ms), and then adding back their tuned CPU consumption of 23.1 milliseconds, the expected
Oracle CPU consumption becomes 1,026,948.1 milliseconds (1,257,478 – 230,553 + 23.1),
which is 1,026.948 seconds. Placing the expected CPU consumption into the standard
utilization formula, we see the expected utilization is about 14%.

R 1,026.948s 1m
U= = ! = 0.143 = 14%
C 4cores ! 30m 60s
As shown in Figure 9-28, Oracle consumed 17% of the available CPU. By creating the
status index, we expect Oracle to consume about 14% of the available CPU. So what
seemed like a massive improvement that would certainly change the performance situation, is
actually expected to result in only a 3% CPU utilization savings. To see what actually
occurred when the index was added, read on!

377
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Full Performance Analysis: Cycle 3


As a result of the first analysis cycle, it was decided to increase the Oracle buffer cache. The
result of that performance-enhancing change was reflected in the second analysis. In the
second analysis, we decided to further increase performance by creating an index on the
customer table column status. We are anticipating around a 3% CPU utilization
decrease and additional room for workload growth. While the top logical IO SQL statements
should have their elapsed times decreased to about zero, only users executing multiple serial
SQL statements are likely to feel any difference. As before, we are not sure how the users will
affect the workload. But as the dotted response-time curve in Figure 9-31 shows, even
doubling the workload should not significantly degrade response time. Here is what actually
happened.

Figure 9-32. Shown is the 30-minute interval ORTA as a result of an increase in the buffer
cache and adding the status column index. Compared to Figure 9-27, as expected, total
queue time is about the same and is insignificant. Far surpassing our expectations, total
service time decreased from 1,254 to 344 seconds! Clearly, the status column index
touched far more SQL statements than our top SQL report showed.

Oracle Analysis
Figure 9-32 shows a rather dramatic decrease in CPU consumption over the 30-minute sample
interval. When adding the status column index, we expected the total Oracle CPU
consumption to drop to around 1,027 seconds. However, it dropped to 343 seconds! So,
obviously the index had a much broader (and positive) impact than we anticipated. Based on
the ORTA, further performance improvements should once again focus on reducing CPU
consumption.

378
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Operating System Analysis


Figure 9-33 shows the operating system is looking even better than before! The CPU
utilization dropped from 20% to 7%, and the IO subsystem is receiving virtually no IO
requests from Oracle. The status index creation reduced Oracle CPU consumption far
more than our anticipated 3%. The index impact was so prolific that it resulted in a 13% CPU
consumption reduction. As with the prior tuning cycle, from an operating system perspective,
using faster CPUs will decrease CPU-related response time.

Figure 9-33. Shown is the operating system analysis information. Because the Oracle load is
almost entirely CPU-based, targeting heavy logical IO SQL statements by creating the
status column index reduced CPU utilization to 7%. Oracle is submitting virtually no IO
requests to the operating system.

Application Analysis
The application situation has profoundly changed. Oracle is now processing fewer logical IOs
while at the same time executing more SQL statements. This means users are getting more
work done but consuming fewer resources! The addition of the status column index had a
much larger and positive impact than we anticipated. Clearly, there were other SQL
statements that benefited from the index creation. Over the 30-minute interval, we anticipated
Oracle CPU consumption would decrease from 1,257 seconds down to 1,027 seconds, but in
reality, the consumption decreased to a staggering 344 seconds.
Comparing the top SQL statements in Figure 9-29 (before index addition) with Figure 9-
34, notice there is now a new top logical IO-consuming SQL statement along with the
previous number three statement (SQL_ID ending in ggt). If performance is to be further
improved, we have once again clearly identified (and supported by an OraPub 3-circle
ORTA-focused approach) the next two SQL statements to address.

379
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-34. Shown is the essential application SQL information. Because of the index
addition, the targeted high logical IO SQL statements no longer appear on the top SQL
report! In addition, the top logical IO SQL statements now consume 8.4M and 4.1M logical
IOs, compared to the earlier case where the top two statements consumed 22.6M and 10.3M
logical IOs, respectively (Figure 9-29). The status column index has had a profound
impact on the most resource-consuming SQL.

It is also very encouraging that as result of the additional index, the top two logical IO
statements consumed a combined 12.5M logical IOs (Figure 9-34), whereas before the index
addition, the top two consumed 32.9M logical IOs (Figure 9-29). So, by aligning our ORTA
with the application analysis, we correctly targeted the high-impact SQL statements.
The beauty of this is the drop in logical IO consumption occurred in conjunction with an
increase in the number of SQL statement executions. For example, Figure 9-30 shows Oracle
processed 134M logical IOs and 48.7K SQL statement executions. But with the additional
index, Figure 9-35 shows Oracle processed only 37M logical IOs while executing over 53.1K
SQL statements! And this all occurred with a reduced CPU utilization.

Figure 9-35. Shown is the workload diagnostic information. Compared to Figure 9-30,
logical IO response time was maintained, Oracle processed fewer logical blocks from 74.7
lio/ms to 20.5 lio/ms, while increasing the SQL execution rate from 27.1 exec/sec up to 29.5
exec/sec!

Full Performance Analysis: Summary


If you have been reading this book sequentially, my hope is you easily followed the preceding
performance analysis cycles. In a way, it is a broad review of the key aspects of this book.
Each cycle involved conducting an OraPub 3-circle analysis, understanding Oracle internals,
performing an ORTA, and anticipating the solution’s impact. The following are the key
points:

380
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

• Spot-on diagnosis resulting in very specific and targeting changes. There should be
no question about how we arrived at our recommendations.
• Each recommendation was accompanied with an anticipatory impact shown both
graphically and numerically. Even without doing any predictive mathematics, the
anticipated performance situation change was clearly evident when presenting the
response-time curves, further building consensus around the recommendations.
• The systematic analysis naturally created an easy-to-follow and convincing story
containing plenty of abstraction for the nontechnical folks, as well as specific Oracle
internals and even some mathematics to satisfy the more technically astute.
• There was no finger-pointing. Each subsystem was investigated and possible
performance-enhancing changes discussed. The implemented recommendations were
objectively selected based on the anticipated impact and ease of implementation.
However, there was no discussion about application usage disruption, uptime
requirements, and availability requirements. In real life, these issues almost always
take precedence over performance.
Table 9-4 shows a rather dramatic and satisfying flow of performance-improvement
metrics. There are a couple items worth highlighting. First, while the physical or logical IO
workload dropped, the number of SQL statement executions increased. While not shown in
this table, the key SQL statements had a continual elapsed time improvement. The reduction
in SQL statement resource consumption occurred in conjunction with a decrease in CPU
utilization and total Oracle response time. This is exactly the kind of result we want to see.

Table 9-4. Full performance analysis key metric change

Oracle Oracle
CPU IOPS
Cycle PIO/ms LIO/ms Exec/sec ST QT
Util. R+W
(sec) (sec)
Baseline 67.78 69.29 25.64 1880 765 28% 4762.9
Buffer
cache 0.00 74.66 27.06 1254 4 20% 0.7
increase
Index
0.00 20.52 29.50 344 4 7% 0.3
addition

While Table 9-4 is a numeric representation of our analysis flow (and success), Figure 9-
36 is a graphical representation. Based on logical IOs, Figure 9-36 shows the initial and final
response-time curves and the respective arrival rates. Technical and nontechnical people alike
should be able to easily grasp that the situation is much better now at point B than when we
started at point A. Adding that there is now more room for growth and that the users are also
performing more real work (SQL statement execution) will add a final punch to our
presentation.

381
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Figure 9-36. Shown is a logical IO-focused response-time curve highlighting and contrasting
the initial performance situation (point A) to the final performance situation (point B). This
response-time curve indicates a very successful performance effort because fewer resources
are required for a single logical IO (service time decreased), users are putting less of a load
on the system (not shown: while their work productivity has increased), and the database
server’s CPU subsystem can now accommodate a much larger future growth.

Improper Index Impact Analysis Performed


When adding the status column index, we anticipated only a 3% decrease in CPU
utilization, but in reality, there was a 13% drop! Always try to be conservative, but in this
case, the anticipated performance impact was simply wrong. We got lucky because many
other SQL statements were impacted (for the better) in addition to the four we targeted.
Because we did not analyze all possible affected SQL statements, there could have easily been
other statements negatively impacted, eliminating any performance gain achieved from our
targeted efforts.
I could have simply left the index addition section out of the book, but I included it for
two reasons. First, to provide another example of how performance change can be anticipated.
Second, so you can observe how easy it is to be wrong by not thinking through a change.

Proper Use of Work and Time


Each cycle of this performance analysis used data from a 30-minute sample. While different
sample durations could have been used, by using the same sample interval, direct numeric
comparisons without a unit of time are possible. For example, I mentioned when the status
column index was added, the number of SQL executions increased from 48.7K to 53.1K. If
the first sample interval was 30 minutes yet the second interval was 60 minutes, we could not
have responsibly made this direct comparison. Instead of stating there was a total of 53.1K
statement executions, we could have stated the statement execution rate was 29.5 exec/sec.
Notice in Table 9-4, for all work-related metrics a unit of time was included. By
providing a unit of work and time, direct comparisons from the past or with other systems can

382
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

be made, regardless of the sample interval. It’s OK to use interval totals without reference to
time, as long as the sample intervals are the same.

Batch Process-Focused Performance Analysis


Many Oracle systems contain a mix of online and batch processing, or perhaps more online-
focused during the day and batch-focused during the night. Obviously, improving batch
processing performance is just as important as improving online processing.
There is a significant difference or shift in focus when working on batch processes. Our
concern shifts from response time per unit of work to total response time. In other words, we
are more concerned about how long it takes to process 500MB of IO or 5,000 seconds of CPU
compared to the response time of a single physical IO. Another way of looking at this is our
unit of work becomes the entire batch process or a step in the batch process.
While the response-time curve can be used when working with batch processes, because
of the longer and singular process time focus, it is not nearly as useful. The response-time
curve shines when it relates time to small units of work, like a logical IO or the execution of a
SQL statement. Because our focus has shifted from small units of work to an entire process or
process segment, our method of reflecting the situation must also change. Instead of using a
response-time curve, the situation can be conveyed numerically in a table format (see Table 9-
5) or by using a simple timeline.

Setting Up the Analysis


Table 9-5 shows how to set up a batch process analysis. The entire batch process has been
segmented into three steps, or segments. Step determination is based on your objectives,
available statistics, and your audience. The time data comes from the same sources as with
online transactions, but, as you’ll see, with a slight yet significant twist. When the process
steps have been defined and the respective data collected, a table similar to Table 9-5 can be
constructed.

Table 9-5. Analyze batch process performance by response time per step

Total
Total Total Total IO Total IO
Other
Step Response CPU Time Read Time Write Time
Time
Time (sec) (sec) (sec) (sec)
(sec)

Load 1,989 89 267 1,628 5


Process 2,106 1,706 239 29 132
Update 624 76 123 403 22
Total 4,719 1,871 629 2,060 159

In addition to helping focus the analysis, a setup similar to Table 9-5 naturally allows us
to calculate anticipated change with a greater degree of accuracy. For example, if we believe
through increased parallelism the Load step’s write time can be reduced by 50%, we can
easily adjust the table and recalculate the response time. So not only does this table help us

383
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

understand the situation, target our efforts, and communicate the situation to others, it also
aids in anticipating performance improvements.
To help understand the complexity of working on each step, additional columns can be
helpful. For example, the table could also include the number of top SQL statements and
some complexity metric. The point is that the table should encourage fruitful discussion and
analysis, so an informed decision can be made about where to target the performance effort.

Capturing Resource Consumption


There are two significant differences in capturing response time information when focused on
a batch process: client process time and background process time.
Capturing Client Process Time
In vivid contrast with online activity, during batch processing, there is no think time and there
can be significant client-processing time. As a result, for the total Oracle response time to
equal batch process elapsed time, our response-time analysis must include client-processing
time and also communication time between the client and server process. As presented in
Chapter 5, this time component is captured by the SQL*Net message from client
and SQL*Net more data from client wait events. If database links are involved,
then don’t forget to also include SQL*Net message from dblink and SQL*Net
more data from dblink. When this normally useless and idle classified time is
included, the batch process elapsed time will equal the total Oracle response time.
Removing Background Process Time
While including background process service and queue time is important and useful when
analyzing an entire system, it can become less useful when focused on a single or a few
sessions or processes. This normally does not present a problem, because session-focused
collection will naturally include only time related to a specific session or group of sessions.
So, what may seem like quite a technical challenge turns out to be not that big of a deal.
Depending on the Oracle release, Statspack and AWR reports may separate background
process time. This makes our job even easier. But if the time is not separated, simply
manually exclude any background process-related time.
When gathering data using your own scripts, remember to make the appropriate
adjustments. The significant time-consuming background process wait events should be of no
surprise: log file parallel write and db file parallel write. There are
others, of course, but these are the most common. The other wait events are easily
recognizable by their event name being associated with a background process. If you are
unsure, refer to Oracle documentation or search Metalink. Even better, sample from
v$session_wait or, for Oracle Database 10g and later, sample from
v$active_session_history or v$session, to see which sessions are posting the
wait event in question.

Including Parallelism in the Analysis


Back in Chapter 3, I stated that serialization is death. When working with batch processes,
this is profoundly important. Suppose that a process takes 60 minutes to complete, and the
system has ample available capacity. We know that if we can alter the process to run in two
streams instead of one, the process may complete twice as fast—in 30 minutes. That is using
parallelism to our advantage.

384
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

With online processes, Oracle has already taken significant steps to increase and take
advantage of parallelism. The existence of multiple server and background processes is an
example of this. However, having a batch Oracle client process related to a single Oracle
server process can become a serialization limitation. So, our parallelism effort will focus on
ways to split the process into multiple streams, each with its own Oracle client and server
process.
Anticipating Elapsed Time
When a process is serialized, there may be plenty of available capacity, but it cannot be used.
For example, if there are four CPU cores providing 240 seconds of CPU power over a 1-
minute period (4 × 1 × 60), but a single stream process is serialized, it can only hope to
consume at most 60 seconds of CPU. If we look at the operating system during the serial
process, average CPU utilization will be 25%, while our CPU-intensive batch process crawls
along. What is needed is increased parallelism to take advantage of the additional and
available resources.
We can mathematically determine batch process segment elapsed time by simply
dividing the required resource time by the available parallelism. For example, suppose a CPU-
intensive batch process segment consumes 120 seconds of CPU. When run serially, this
process takes 120 seconds. After some analysis, it was determined the process could be split
into three parallel streams without any Oracle concurrency issues. The anticipated elapsed
time becomes 40 seconds. The formula is as follows:

R 120s
E= = = 40s
P 3
where:
• E is the elapsed time.
• R is the resources required or duration.
• P is the used and available parallel streams.
For this example, 120 seconds of CPU is required and three parallel streams are
available, so the anticipated elapsed time is 40 seconds. If we looked at the average CPU
utilization, it would now be around 75% busy, because only three of the four available CPU
cores are being used.
The used and available parallelism are obviously very important. Just because four CPU
cores or 100 IO devices exist does not mean they can be used. In our example, there are four
CPU cores available, but the application developers were able to create only three parallelism
streams. And, of course, if the application creates ten parallel streams, yet only four cores are
available, the CPU subsystem will become a bottleneck, operating deep in the elbow of the
response-time curve.
Scalability Issues
Expecting a double in parallelism to yield a two times performance improvement is a best-
case scenario that is unlikely. So there are a number of reasons why parallelism can be
limited:
• There must be available resources.

385
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

• Oracle concurrency issues—such as enqueues, buffers being busy, and latching—can


arise.
• Processes that are split typically must have their results merged, which may force the
creation of an additional process or, if the merge process already exists, it may take
more time to complete.
• There is classic operating system-related scalability.
In reality, with every additional parallel resource (for example, CPU core), a fraction of
the power effectively becomes unavailable or lost. As mentioned, if a batch process is split,
there may be the need to merge the results. The merge process is the direct result of the
increased parallelism, and this constitutes a piece of perceived processing gain we effectively
lose. It’s true that overall we can reduce elapsed time, but the scalability effect is real, and it
grows as the number of parallel streams increases.
There are a number of ways to determine the scalability effect. The simplest way is to
run tests and get to know your application. If that is not practical, then be conservative. There
are also a number of ways we can numerically represent the scalability effect. For example,
let’s suppose with every additional parallel stream, 10% is lost to scalability. This results in a
more realistic elapsed time expectation. There are many methods of account for scalability.
For this example, I chose to use a simple yet robust geometric scaling model. The elapsed
time formula now becomes as follows:

R R
E= = 0
P ! + � + ! P "1
where:
• E is the elapsed time.
• R is the resources required.
• P is the used and available parallelism.
• Φ is the parallel factor: 1 is complete parallelism, and 0 is no parallelism.
Applying scalability to our example of splitting the batch process into three streams and
an optimistic parallelization factor of 98%, the elapsed time is calculated as follows:

R 120s 120s
E= = 2 = = 41s
P 0.98 + 0.98 + 0.98
0 1
2.94
I realize that most DBAs will not be so scientific in their work, but please do not forget
to include some overhead when anticipating a parallelization increase. Forgetting or ignoring
scalability will produce overly optimistic predictions.13

13
For an in-depth discussion of scalability, refer to Forecasting Oracle Performance (Apress, 2007), Chapter
10, “Scalability”.

386
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

Operating in the Elbow of the Curve


In Chapter 4, I noted that, in certain situations, it is desirable to encourage a system to be
operating in the elbow of the response-time curve. That discussion should make a whole lot
more sense now, and I’ll summarize it again.
When focused on online user response time, a snappy response is desired. To increase
the likelihood of snappy response time, we want the likelihood of queuing to be very low. We
encourage this by ensuring a low utilization by keeping the arrival rate low enough to prohibit
the system from operating in the elbow of the curve. While this produces snappy online
response time, it also leaves available computing resources on the table. With a batch process
focus, we want to use those leftover and available computing resources. In fact, to leave any
available resource unused can be considered wasteful, shows parallelism is limited, and could
result in a longer elapsed time. So, with batch processing, we look for ways to use any and all
computing resources available to minimize elapsed time. This means the system will be
operating in the elbow of the curve.
From a CPU perspective, this means the average run queue will always be equal to or
greater than the number of CPU cores. But there is a limit to our aggressive resource usage. If
we allow the system to push too far into the elbow, the overhead associated with managing
the system increases to the point where the elapsed time begins to degrade. Our job as
performance analysts is to find the sweet spot and operate the batch processing there!

Summary and Next Steps


This chapter is truly about performance analysis. Some would say it is enough to find a
problem and make some changes. I respectfully disagree. I believe we can do so much more,
be so much more effective, and add greater value to our organization. Nearly any Oracle
analysis will result in a number of recommendations. But in order to responsibly decide which
changes to implement first, some ranking must occur. This chapter focused on bringing
rational debate and consensus to ranking the performance-enhancing solutions. I hope you
have found it enlightening and practical.
This final chapter brings a natural and fitting finality to Oracle Performance
Firefighting. We started with method, moved into diagnosis and data collection, then into
relevant Oracle internals, enabling us to intelligently choose valid performance-enhancing
solutions, and then finally anticipate the impact of our proposed solutions. It’s part of being an
effective Oracle performance firefighter.
I truly wish this book had been available in 1989 when I first joined Oracle’s consulting
division. It would have made such a difference! So to those of you who are relatively new to
Oracle optimization, here it is and enjoy! And those of you who have been slugging along for
years optimizing Oracle systems, I hope you have a renewed enthusiasm and expectation for
your work.
Thank you for taking the time to read my book. I look forward to hearing from you!

387
Oracle Performance Analysis
Updated fourth printing of Chapter 9 for holders of prior printed and PDF book versions.

This page has been left intentionally blank.

388

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy