Sunday, July 5, 2009

Databases: UK Oracle User Group "How To Read Statspack/AWR" and Jonathon Lewis


Jonathon Lewis - an inspiring teacher


Whilst the many interest ideas and people at the July 2009 DBA meeting in London, the undoubted highlight was Jonathon Lewis presentation on "How To Read Statspack/AWR".

 

Jonathon is undoubtedly one of the most experienced Oracle database performance experts in the UK, but beyond this he is a great presenter. I first saw Jonathon 5 years ago (May 2004) giving a presentation regarding the 9i Cost Base Optimizer, Collecting Stats and some gotchas (e.g. after collecting stats your performance can get worse, why collecting detailed stats on key small tables can be critical, what sample size should use for large tables holding the bulk of your data ...) a previous incarnation of this presentation in May 2004, again at a UK Oracle User Group meeting.

 

In May 2004 I had only be working as a DBA for 5 months, and whilst I understood the basics of SQL, Indexes and PL/SQL given I had been working as a developer for J2EE/Oracle developer for the 3 years before; in May 2004 this turbo charged presentation highlight key features and gotchas of the 9i Cost Based Optimizer was going to be tough material for a rooky DBA. However like a true "inspiring teacher", Jonathon was able to make these complex ideas and structures accessible and exciting. 


My overview of the presentation


For this latest presentation I have written a detailed summary of what I learnt during Jonathon's expert presentation (as this best way to re-enforce my learning and also to chart my own progress with Oracle performance monitoring).


Jonathon has given this current presentation on how to read statspack (or AWR) several times recently, and some of the feedback he has received was that: yes it was fantastic the way he can dissect all the information and pick out key stats and then paint a correct picture of the system and where the bottleneck were and what they should concider doing to resolve them. 


Jonathon always starts modestly, but as he starts to investigate, builds up a picture of the of the physical system, it's usage and critically where the bottlenecks and heavy load is and often rapidly focuses on key problems. He always has plenty of caveats and even when you resolved your current bottleneck, there are no guarantees that you will not hit new problems which are lurking in the shadows of your current heavy load.


One piece of feedback which he has received a couple of times was that his presentation was that it was almost "magical". While we all like to be admired, the point of his presentation was instruct and train and not to sell Jonathon's consultancy skills (Jonathon is a member is a member of the Oak Circle - a small elite group of top Oracle database experts??). So for this presentation he was going to focus on the core activities around analyzing a statspack report, although he digress into discuss high "leaf node split", a classic problem where the index included a "last updated timestamp",which rapidly leads to a highly fragemented indexes.


How I approach Statspack / AWR


When I am looking at a look at an AWR reports I tend to look at it in a rather one dimensions way:


1) Identify the bottleneck


Typically I start at the Top 5 Wait Events i.e. is the bottleneck on CPU, I/O, row locks, buffer busy waits ...


I will check the Load Profile, particularly if the system appears to be i/o bound then I am particularly interested in the buffer cache hit ratio. I have not seen many parsing problems within the SAP based systems I have been looking at over the last year: this may be cause the SAP code adheres to good Oracle development practises i.e. always use behind variables to minimise the need for hard parsing, alternatively as the systems I have been reviewing have large shared pools (1.5GB+)  this may be why I regularly see soft parse rates of  99%+


If the problem appears to be data access related (buffer reads, physical reads or enqueue), then I will also look to the segment stats to identify which tables are the worse effected.


Once I have made an initial assessment of what sort of problems we have, we can look at the "top sql" to identify which statements are the worse offenders.


2) Check top sql


This shows the top sql in several sections/ordered lists: Elapsed Time, CPU, Buffer Reads, Physical Read, Executions and Parse Calls. Often the same SQL statements feature across multiple sections.


For example, AWR is reporting "read by other session" as the top wait event? Looking at the "top sql", one sql_id stands out in terms of elapsed timed: x10 higher than the 2nd item of the list! However this sql_id is less significant in terms of the other key metrics:


  • Reads 1st place, 50% more of the 2nd item
  • CPU 2nd place, only 40% of top item
  • Gets 4th place, only 10% compared with top 3 entries
  • Executions - not on "top sql by executions" list 



In this case, using the 10g tuning reports, there is a new index which should reduce the i/o costs by a factor by 40?


3) Other sections I have been checking in AWR  


While the above gives a good indication of how I have been using AWR, in other circumstances I have used other stats:


investigating high "db file sequential reads" - top wait event


"Segments by Physical Reads" : 

    double checking/confirm the results for the "top sql by [physical] reads"?


"Segments by Logical Reads": 

    are high proportion of reads are physical? 

    are certain key processes being squeezed out of the buffer cache? 

    do we need a large buffer cache? 


"Buffer pool Advisory":

    again looking for evidence and the effect of increasing the buffer cache?


"Filo IO Stats" and "Table IO Stats":

    are we experiencing performance bottlenecks on certain tablespaces or files



investigating high "row locking contentions" - top wait event



"Segments by Row Lock Waits" : 

    Which segments have high row lock waits? 

    This can then be tied back to the Top SQL (particularly by elapsed 

time) to see the source sql_ids for the row locks ...


Comparing statspacks / AWR?



One thing which Jonathon was able to do in the structure of this presentation but is "standard practice", is to compare two reports, one from an interval when the system is struggling against a second when the system is performing OK? Ideally you want to compare two interval which are performing the same sort of processing. It is difficult to compare online processing with batch and ideally if you are struggling with say month end  processing, then if you have a statspack report from a previous month-end, when performance was OK (unfortunately we do not routinely generate and store reports for when the system is running well, this makes monthly-end, quarter-end and year-end performance problems harder to analysis).




What did I learn?


Anyway in his presentation, requested statspack reports from the audience but was only given a single report (when the performance was poor). There was a real element of live performance here, a little bit pressure for the presentor, but JL is extremely comfortable and adapted with statspack analysis. The key thing I learnt for watching him goes through several reports in a "rapid fire format"


1) Broader range of stats for initial analysis


a) Load Profile


He went through the "Load Profile" in more detail for example:


  • Redo size (per sec) - how many updates, inserts and deletes going on with your system? For example, 890K/sec that would be roughly 3.2GB or redo per hour - is this high for a production db?
  • Block changes (per sec) - 2800/sec again is this high, this stat appears
  • Excecution - how busy are your CPUs, the number of executions needs to be divided by the number of CPU cores you have e.g. if you have 4 quad cores, then the load across each CPU is only 1/16
  • ...


Jonathon rapidly went through making several good points regarding the load profile, after his presentation I went to his blog (http://jonathanlewis.wordpress.com/) and found a link to good pdf document from Oracle which documents neatly many of these basic analysis points Jonathon was making (this pdf also explains how to drill down and analyse some of the more common bottleneck scenario)


 

Load Profile 

~~~~~~~~~~~~                            Per Second       Per Transaction 

                                   ---------------       --------------- 

                  Redo size:          1,316,849.03              6,469.71 

              Logical reads:             16,868.21                 82.87 

              Block changes:              5,961.36                 29.29 

             Physical reads:                  7.51                  0.04 

            Physical writes:              1,044.74                  5.13 

                 User calls:              8,432.99                 41.43 

                     Parses:              1,952.99                  9.60 

                Hard parses:                  0.01                  0.00 

                      Sorts:                  1.44                  0.01 

                     Logons:                  0.05                  0.00 

                   Executes:              1,954.97                  9.60 

               Transactions:                203.54 

 

  % Blocks changed per Read:   35.34    Recursive Call %:               25.90 

 Rollback per transaction %:    9.55       Rows per Sort:              137.38 

 


Observations: 

• This system is generating a lot of redo (1mb/s), with 35% of all blocks read being updated. 

• Comparing the number of Physical reads per second to the number of Physical writes per second shows the physical read 

to physical write ratio is very low (1:49).  Typical OLTP systems have a read-to-write ratio of 10:1 or 5:1 - this 

ratio (at 1:49) is quite unusual. 

• This system is quite busy, with 8,432 User calls per second. 

• The total parse rate (Parses per second) seems to be high, with the Hard parse rate very low, which implies the 

majority of the parses are soft parses.  The high parse rate may tie in with the latch free event, if the latch 

contended for is the library cache latch, however no assumptions should be made. 

On the whole, this seems to be a heavy workload, with many parses, and writes. 

http://www.oracle.com/technology/deploy/performance/pdf/statspack_opm4.pdf


When Jonathon was looking at physical reads per seconds, he also factor in the "multi block read count", when looking at the stats for "db file scattered read":



Although confusing, a db file sequential read event indicates a single block read is occurring, and a db file scattered read indicates a multiblock read is occurring.  This terminology is industry standard terminology, and can be explained thus: the term scattered relates to the multiple disk blocks being read into scattered (i.e. discontiguous) memory locations.  Similarly, the term sequential indicates the block is being read into a single contiguous memory location (in this case, a single database buffer). 

http://www.oracle.com/technology/deploy/performance/pdf/statspack_opm4.pdf



 I  have not seen many parsing problems within the SAP based systems I have been looking at over the last year: this may be cause the SAP code adheres to good Oracle development practises i.e. always use behind variables to minimise the need for hard parsing, alternatively as the systems I have been reviewing have large shared pools (1.5GB+)  this may be why I regularly see soft parse rates of  99%+.


I will check the Load Profile, particularly if the system appears to be i/o bound then I am particularly interested in the buffer cache hit ratio. 


a) Wait Events


Whilst I think I have been focusing only on the top 5 wait events, there were several key metrics which he checked (to see the nature of database processing) e.g. enqueues, smon, pmon. Here he appears to building up a better picture of the database. 


Also both Jonathon and Connie Green, talked about check disc and SAN speeds 



Identify the magnitude of the wait, and if possible, compare Oracle statistics to any relevant OS data (e.g. check whether the average read time for the event db file sequential read, and the OS read time for the disk containing that datafile are consistent, and as expected). 

http://www.oracle.com/technology/deploy/performance/pdf/statspack_opm4.pdf


Actually issues around slow discs and particularly SANs was a re-occurring theme across the day, and several of the DBAs in the audience spoke about there concerns here.


This is something I want to look at in more detail in the future.




2) Experience - being to much quicker work his way through the report 


One of the things so impressive and dramatic was Jonathon's ability to whip through the information. This is where the touch of magic came in. 


However hving been dazed by the sheer finese of Jonathon's rapid and precise anaylsis. I think it is important to go back and pull out all the important points he made (hence this blog post). 


It is quite exciting as this very hard and inaccessible area (Oracle Performance Tuning) starts to make sense and become comfortable.



3) Ability to understand and spot more obscure problems


The problem with high leaf node splits, due to an index which contain field which are continually being changes (e.g. last updated timestamp) is probably actually a realtively common problem. However there are many other even more obscure problems which we need to be aware. 


In David Kurtz presentation, he talk about a "war story" where a customer was experience poor periodic performance, careful monitoring showed the performance typically droping off quite rapidly when they had 90+ users. The perforamnce bottleneck was on high latch waits (latchs are "internal locks within the Oracle Data Dictionary" i.e. not firectly on actual end-user data).


No comments: