Perform Without Waiting
By Arup Nanda
OTN
Diagnose performance problems, using the wait interface in Oracle 10g.
John, the DBA at Acme Bank, is on the phone with an irate user, Bill, who complains that his database session is hanging, a complaint not unfamiliar to most DBAs. What can John do to address Bill's complaint?
Acme Bank's database is Oracle Database 10g, so John has many options. Automatic Database Diagnostic Manager (ADDM), new in Oracle Database 10g, can tell John about the current overall status and performance of the database, so John starts with ADDM to determine whether what Bill's session is experiencing is the result of a databasewide issue. The ADDM report identifies no databasewide issues that could have this impact on Bill's session, so John moves on to the next option.
One way to diagnose session-level events such as Bill's is to determine whether the session is waiting for anything, such as the reading of a block of a file, a lock on a table row, or a latch. Oracle has provided mechanisms to display the waits happening inside the database since Oracle7, and during the last several years, the model has been steadily perfected, with more and more diagnostic information added to it. In Oracle Database 10g, which makes significantly improved wait event information available, diagnosing a session slowdown has become even easier. This article shows you how to use the wait events in Oracle Database 10g to identify bottlenecks.
Session Waits
How can John the DBA determine what's causing Bill's session to hang? Actually, the session is not hanging; it's waiting for an event to happen, and that's exactly what John checks for.
To continue his investigation, John could use Oracle Enterprise Manager or he could directly access V$ views from the command line. John has a set of scripts he uses to diagnose these types of problems, so he uses the command line.
John queries the V$SESSION view to see what Bill's session is waiting for. (Note that John filters out all idle events.)
select sid, username, event, blocking_session,
seconds_in_wait, wait_time
from v$session where state in ('WAITING')
and wait_class != 'Idle';
The output follows, in vertical format.
SID : 270
USERNAME : BILL
EVENT : enq: TX - row lock contention
BLOCKING_SESSION : 254
SECONDS_IN_WAIT : 83
WAIT_TIME : 0
Looking at this information, John immediately concludes that Bill's session with SID 270 is waiting for a lock on a table and that that lock is held by session 254 (BLOCKING_SESSION).
But John wants to know which SQL statement is causing this lock. He can find out easily, by issuing the following query joining the V$SESSION and V$SQL views:
select sid, sql_text
from v$session s, v$sql q
where sid in (254,270)
and (q.sql_id = s.sql_id or
q.sql_id = s.prev_sql_id);
Listing 1 shows the result of the query. And there (in Listing 1) John sees it—both sessions are trying to update the same row. Unless session 254 commits or rolls back, session 270 will continue to wait for the lock. He explains this to Bill, who, considerably less irate now, decides that something in the application has gone awry and asks John to kill session 254 and release the locks.
Code Listing 1: V$SESSION and V$SQL query finds lock
SID SQL_TEXT
--- ---------------------------------------------------------------------
270 update accounts set balance = balance - 750 where account_no = 333
254 update accounts set balance = balance - 1000 where account_no = 333
Wait Classes
After John kills the blocking session, Bill's session continues but is very slow. John decides to check for other problems in the session. Again, he checks for any other wait events, but this time he specifically checks Bill's session.
In Oracle Database 10g, wait events are divided into various wait classes, based on their type. The grouping of events lets you focus on specific classes and exclude nonessential ones such as idle events. John issues the following against the V$SESSION_WAIT_CLASS view:
select wait_class_id, wait_class,
total_waits, time_waited
from v$session_wait_class
where sid = 270;
The output, shown in Listing 2, shows the wait classes and how many times the session has waited for events in each class. It tells John that application-related waits such as those due to row locks have occurred 17,760 times, for a total of 281,654 centiseconds (cs)—hundredths of a second—since the instance started. John thinks that this TIME_WAITED value is high for this session. He decides to explore the cause of these waits in the application wait class. The times for individual waits are available in the V$SYSTEM_EVENT view. He issues the following query to identify individual waits in the application wait class (class id 4217450380):
Code Listing 2: Waits summarized by wait classes
WAIT_CLASS_ID WAIT_CLASS TOTAL_WAITS TIME_WAITED
------------- --------------- ----------- -----------
1893977003 Other 16331 44107
4217450380 Application 17760 281654
3290255840 Configuration 834 2794
3875070507 Concurrency 1599 96981
3386400367 Commit 13865 4616
2723168908 Idle 513024 103732677
2000153315 Network 254534 379
1740759767 User I/O 32709 53182
4108307767 System I/O 103019 9921
select event, total_waits, time_waited
from v$system_event e, v$event_name n
where n.event_id = e.event_id
and wait_class_id = 4217450380;
Listing 3 shows the output of this query. It shows that lock contentions (indicated by the event enq: TX - row lock contention) constitute the major part of the waiting time in the application wait class. This concerns John. Is it possible that a badly written application made its way through to the production database, causing these lock contention problems?
Code Listing 3: Waits in a specific wait class—"application"
EVENT TOTAL_WAITS TIME_WAITED
------------------------------ ----------- -----------
enq: RO - fast object reuse 5 24
enq: TX - row lock contention 2275 280856
SQL*Net break/reset to client 15696 822
Being the experienced DBA that he is, however, John does not immediately draw that conclusion. The data in Listing 3 merely indicates that the users have experienced lock-contention-related waits a total of 2,275 times, for 280,856 cs. It is possible that mostly 1- or 2-cs waits and only one large wait account for the total wait time, and in that case, the application isn't faulty. A single large wait may be some freak occurrence skewing the data and not representative of the workload on the system. How can John determine whether a single wait is skewing the data?
Oracle 10g provides a new view, V$EVENT_HISTOGRAM, that shows the wait time periods and how often sessions have waited for a specific time period. He issues the following against V$EVENT_HISTOGRAM:
select wait_time_milli bucket, wait_count
from v$event_histogram
where event =
'enq: TX - row lock contention';
The output looks like this:
BUCKET WAIT_COUNT
----------- ----------
1 252
2 0
4 0
8 0
16 1
32 0
64 4
128 52
256 706
512 392
1024 18
2048 7
4096 843
The V$EVENT_HISTOGRAM view shows the buckets of wait times and how many times the sessions waited for a particular event—in this case, a row lock contention—for that duration. For example, sessions waited 252 times for less than 1 millisecond (ms), once less than 16 ms but more than 1 ms, and so on. The sum of the values of the WAIT_COUNT column is 2,275, the same as the value shown in the event enq: TX - row lock contention, shown in Listing 3. The V$EVENT_HISTOGRAM view shows that the most waits occurred in the ranges of 256 ms, 512 ms, and 4,096 ms, which is sufficient evidence that the applications are experiencing locking issues and that this locking is the cause of the slowness in Bill's session. Had the view showed numerous waits in the 1-ms range, John wouldn't have been as concerned, because the waits would have seemed normal.
Time Models
Just after John explains his preliminary findings to Bill, Lora walks in with a similar complaint: Her session SID 355 is very slow. Once again, John looks for the events the session is waiting for, by issuing the following query against the V$SESSION view:
select event, seconds_in_wait,
wait_time
from v$session
where sid = 355;
The output, shown in Listing 4, shows a variety of wait events in Lora's session, including latch contention, which may be indicative of an application design problem. But before he sends Lora off with a prescription for an application change, John must support his theory that bad application design is the cause of the poor performance in Lora's session. To test this theory, he decides to determine whether the resource utilization of Lora's session is extraordinarily high and whether it slows not only itself but other sessions too.
Code Listing 4: Waits experienced by a specific session
EVENT SECONDS_IN_WAIT WAIT_TIME
------------------------------- --------------- ---------
latch: cache buffers lru chain 172 0
latch: checkpoint queue latch 2 0
latch: cache buffers chains 607 0
buffer busy waits 500 0
db file sequential read 30247 0
db file scattered read 887 0
In the Time Model interface of Oracle Database 10g, John can easily view details of time spent by a session in various activities. He issues the following against the V$SESS_TIME_MODEL view:
select stat_name, value
from v$sess_time_model
where sid = 355;
The output, shown in Listing 5, displays the time (in microseconds) spent by the session in various places. From this output, John sees that the session spent 503,996,336 microseconds parsing (parse time elapsed), out of a total of 878,088,366 microseconds on all SQL execution (sql execute elapsed time), or 57 percent of the SQL execution time, which indicates that a cause of this slowness is high parsing. John gives Lora this information, and she follows up with the application design team.
Code Listing 5: Session time model
STAT_NAME VALUE
--------------------------------------------------- ----------
DB time 878239757
DB CPU 835688063
background elapsed time 0
background cpu time 0
sequence load elapsed time 0
parse time elapsed 503996336
hard parse elapsed time 360750582
sql execute elapsed time 878088366
connection management call elapsed time 7207
failed parse elapsed time 134516
failed parse (out of shared memory) elapsed time 0
hard parse (sharing criteria) elapsed time 0
hard parse (bind mismatch) elapsed time 0
PL/SQL execution elapsed time 6294618
inbound PL/SQL rpc elapsed time 0
PL/SQL compilation elapsed time 126221
Java execution elapsed time 0
OS Statistics
While going over users' performance problems, John also wants to rule out the possibility of the host system's being a bottleneck. Before Oracle 10g, he could use operating system (OS) utilities such as sar and vmstat and extrapolate the metrics to determine contention. In Oracle 10g, the metrics at the OS level are collected automatically in the database. To see potential host contention, John issues the following query against the V$OSSTAT view:
select * from v$osstat;
The output in Listing 6 shows the various elements of the OS-level metrics collected. All time elements are in cs. From the results in Listing 6, John sees that the single CPU of the system has been idle for 51,025,805 cs (IDLE_TICKS) and busy for 2,389,857 cs (BUSY_TICKS), indicating a CPU that is about 4 percent busy. From this he concludes that the CPU is not a bottleneck on this host. Note that if the host system had more than one CPU, the columns whose headings had the prefix AVG_, such as AVG_IDLE_TICKS, would show the average of these metrics over all the CPUs.
Code Listing 6: Output from V$OSSTAT view
STAT_NAME VALUE OSSTAT_ID
-------------------------- ----------- ----------
NUM_CPUS 1 0
IDLE_TICKS 51025805 1
BUSY_TICKS 2389857 2
USER_TICKS 1947618 3
SYS_TICKS 439736 4
NICE_TICKS 2503 6
AVG_IDLE_TICKS 51025805 7
AVG_BUSY_TICKS 2389857 8
AVG_USER_TICKS 1947618 9
AVG_SYS_TICKS 439736 10
AVG_NICE_TICKS 2503 12
RSRC_MGR_CPU_WAIT_TIME 0 14
IN_BYTES 16053940224 1000
OUT_BYTES 96638402560 1001
AVG_IN_BYTES 16053940224 1004
AVG_OUT_BYTES 96638402560 1005
Active Session History
So far the users have consulted John exactly when each problem occurred, enabling him to peek into the performance views in real time. This good fortune doesn't last long—Janice comes to John complaining about a recent performance problem. When John queries the V$SESSION view, the session is idle, with no events being waited for. How can John check which events Janice's session was waiting for when the problem occurred?
Oracle 10g collects the information on active sessions in a memory buffer every second. This buffer, called Active Session History (ASH), which can be viewed in the V$ACTIVE_SESSION_HISTORY dynamic performance view, holds data for about 30 minutes before being overwritten with new data in a circular fashion. John gets the SID and SERIAL# of Janice's session and issues this query against the V$ACTIVE_SESSION_HISTORY view to find out the wait events for which this session waited in the past.
select sample_time, event, wait_time
from v$active_session_history
where session_id = 271
and session_serial# = 5;
The output, excerpted in Listing 7, shows several important pieces of information. First it shows SAMPLE_TIME—the time stamp showing when the statistics were collected—which lets John tie the occurrence of the performance problems to the wait events. Using the data in the V$ACTIVE_SESSION_HISTORY view, John sees that at around 3:17 p.m., the session waited several times for the log buffer space event, indicating that there was some problem with redo log buffers. To further aid the diagnosis, John identifies the exact SQL statement executed by the session at that time, using the following query of the V$SQL view:
Code Listing 7: Output of active session history
SAMPLE_TIME EVENT WAIT_TIME
-------------------------- --------------------------- ---------
22-FEB-04 03.17.23.028 PM latch: library cache 14384
22-FEB-04 03.17.24.048 PM latch: library cache 14384
22-FEB-04 03.17.25.068 PM log file switch completion 17498
22-FEB-04 03.17.26.088 PM log file switch completion 17498
22-FEB-04 03.17.27.108 PM log buffer space 11834
22-FEB-04 03.17.28.128 PM log buffer space 11834
22-FEB-04 03.17.29.148 PM log buffer space 11834
22-FEB-04 03.17.30.168 PM log buffer space 11834
22-FEB-04 03.17.31.188 PM log buffer space 11834
select sql_text, application_wait_time
from v$sql
where sql_id in (
select sql_id
from v$active_session_history
where sample_time =
'22-FEB-04 03.17.31.188 PM'
and session_id = 271
and session_serial# = 5
);
The output is shown in Listing 8.
Code Listing 8: Text of the SQL from active session history
SQL_TEXT APPLICATION_WAIT_TIME
------------------------------------------------------- ---------------------
update accounts set balance = balance -750where950 account_no= 333
update accounts set balance = balance - 845 where 457 account_no = 451
update accounts set balance = balance - 434 where 1235 account_no = 239
The column APPLICATION_WAIT_TIME shows how long the sessions executing that SQL waited for the application wait class. In addition to the SQL_ID, the V$ACTIVE_SESSION_HISTORY view also lets John see specific rows being waited for (in case of lock contentions), client identifiers, and much more.
What if a user comes to John a little late, after the data is overwritten in this view? When purged from this dynamic performance view, the data is flushed to the Active Workload Repository (AWR), a disk-based repository. The purged ASH data can be seen in the DBA_HIST_ACTIVE_SESSION_HIST view, enabling John to see the wait events of a past session. The data in the AWR is purged by default after seven days.
Conclusion
Oracle Database 10g introduces a number of enhancements designed to automate and simplify the performance diagnostic process. Wait event information is more elaborate in Oracle Database 10g and provides deeper insight into the cause of problems, making the diagnosis of performance problems a breeze in most cases, especially in proactive performance tuning.
Tuesday, December 30, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment