Understanding vs Resolution – Statistics History Cleanup

Today I helped resolve a problem quickly but to the detriment of my understanding of exactly what was going on and why.

And as part of the quicker resolution, I had to destroy the evidence which would have helped understand better.

So… now need to go back and figure it out if I have time to model it properly, etc.

Here’s what little I know so far.

What happened was that there were a number of application sessions experiencing slowness and contention when executing a recursive stats history cleanup statement.

Verified via ASH that this recursive delete was somehow being called by app ETL code (TOP_LEVEL_SQL_ID, PLSQL_ENTRY_OBJECT_ID & USER_ID columns), four of them each running slow statement and also because of that nature of the statement below and being blocked by mode 6 TX locks from the leading execution.

Version is 11.2.0.3

Statement was sql id 9v9n97qj8z1dg:

delete /*+ dynamic_sampling(4) */ 
from sys.wri$_optstat_histhead_history 
where savtime < :1 
and rownum <= NVL(:2, rownum);

First up, resolution was quick and easy according to, by a quirk of coincidence, my last post:
https://orastory.wordpress.com/2015/02/16/dbms_stats-purge_stats/

We just got rid of all the statistics history using the magic PURGE_ALL truncate flag.
The history is of limited usefulness day-to-day anyway.

The slowness was holding up ETL jobs which were just trying to calculate stats for their own partition.

I was brought into the situation towards the end of the situation but here’s an illustration of the slowness:

SQL Text
------------------------------
delete /*+ dynamic_sampling(4) */ from sys.wri$_optstat_histhead_history where savtime < :1 and rownum <= NVL(:2, rownum)

Global Information
------------------------------
 Status              :  EXECUTING               
 Instance ID         :  1                       
 SQL ID              :  9v9n97qj8z1dg           
 SQL Execution ID    :  16777249                
 Execution Started   :  02/24/2015 19:11:25     
 First Refresh Time  :  02/24/2015 19:13:06     
 Last Refresh Time   :  02/25/2015 09:10:35     
 Duration            :  50351s                  
 Module/Action       :  JDBC Thin Client/-      
 Program             :  JDBC Thin Client        

Binds
========================================================================================================================
| Name | Position |  Type  |                                           Value                                           |
========================================================================================================================
| :2   |        2 | NUMBER | 10000                                                                                     |
========================================================================================================================

Global Stats
===============================================================================================
| Elapsed |   Cpu   |    IO    | Application | Concurrency |  Other   | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   |  Waits(s)   | Waits(s) |  Gets  | Reqs | Bytes |
===============================================================================================
|   50359 |   35199 |     0.16 |       14669 |         254 |      237 |     1G |   63 | 504KB |
===============================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2348801730)
=======================================================================================================================================================================================================
| Id   |            Operation             |             Name              |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  | Activity |            Activity Detail            |
|      |                                  |                               | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes |   (%)    |              (# samples)              |
=======================================================================================================================================================================================================
|    0 | DELETE STATEMENT                 |                               |         |       |     50262 |   +101 |  1423 |        1 |      |       |          |                                       |
|    1 |   DELETE                         | WRI$_OPTSTAT_HISTHEAD_HISTORY |         |       |     50361 |     +2 |  1423 |        1 |      |       |    29.74 | enq: TX - row lock contention (14630) |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | buffer busy waits (231)               |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (82)                              |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | resmgr:cpu quantum (5)                |
|    2 |    COUNT                         |                               |         |       |     50262 |   +101 |  1423 |      14M |      |       |     0.00 | Cpu (1)                               |
|    3 |     FILTER                       |                               |         |       |     50266 |    +97 |  1423 |      14M |      |       |    30.14 | Cpu (15146)                           |
|    4 |      TABLE ACCESS BY INDEX ROWID | WRI$_OPTSTAT_HISTHEAD_HISTORY |      6M | 23218 |     50253 |   +100 |  1423 |       4G |   22 | 176KB |    23.17 | buffer busy waits (18)                |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (11627)                           |
| -> 5 |       INDEX RANGE SCAN           | I_WRI$_OPTSTAT_HH_ST          |    990K |  5827 |     50264 |   +101 |  1423 |       4G |   41 | 328KB |    16.94 | buffer busy waits (3)                 |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (8332)                            |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | resmgr:cpu quantum (180)              |
=======================================================================================================================================================================================================
   3 - filter(ROWNUM<=NVL(:2,ROWNUM))
   5 - access("WRI$_OPTSTAT_HISTHEAD_HISTORY"."SYS_NC00024$"<SYS_EXTRACT_UTC(:1))

So, first thought was that:

1. I thought that MMON was responsible for purging old stats… clearly not in this case as there were multiple concurrent application connections purging old data as part of their ETL process.

2. The DELETE is deleting any old history older than a parameterised date, the first 10000 rows thereof. There is no connection to the object on whose statistics the application ETL jobs were working on.

3. I would expect a COUNT STOPKEY operation but then the NVL predicate avoids this optimisation.

4. UPDATE: The real time sql monitoring output is also reporting the metrics for 1423 executions of this statement. The clue is in the 14M rows reported in the COUNT + FILTER operations. 1423 * 10000 rows = 14M. But I requested the output for what I thought was a single execution id – 16777249 – strange… bulk/batched delete ?

More questions than answers…

Advertisements

8 Responses to Understanding vs Resolution – Statistics History Cleanup

  1. Pingback: Real time sql monitoring – bulk bound statement | OraStory

  2. >The DELETE is deleting any old history older than a parameterised date, the first 10000 rows thereof.

    Yes, I have seen similar behaviour. Just query V$SQL for similar SQL statements.

    I see deletes from sys.wri$_optstat_histgrm_history by OBJ# and INTCOL#, which I can understand.

    But deletes like
    sys.wri$_optstat_opr
    sys.wri$_optstat_aux_history
    sys.wri$_optstat_ind_history
    sys.wri$_optstat_histhead_history
    sys.wri$_optstat_tab_history
    are simply by SAVTIME ROWNUM

    • Dom Brooks says:

      Thanks Hemant.
      I’m familiar with these. It was the one above being run by four app processes which was causing me problems.

      Now I think the batched ROWNUM delete is fine when being done by a single process – MMON.

      The question is why were multiple concurrent app processes running it?

      I traced some sessions in a test DB and I got the inserts into history – it’s obvious that when you change an object’s stats your process is responsible for the recursive work of inserting the history record.

      But THAT delete didn’t come up in the trace so I’m missing something.

      And I wonder if that is MMON?

      What happens if I kill MMON and then possibly have to wait a certain time threshold? Will I see this Delete delegated to application threads doing stats work?
      That is my current theory, awaiting testing.

      If that is what happens it’s really short-sighted approach.

    • Dom Brooks says:

      I checked AWR in prod and there are no AWR snaps for a 24 hour period coinciding with when this problem happened.

      Good for my theory.

      Bad for the existing database monitoring.

      Going to get some more info before trying this on my VM DBA where I can kill MMON at will and then figure out if there’s any time threshold involved.

      Problem is that I only have 12c VM here so … if I can reproduce similar behaviour then that is a good indicator if I can’t then it’s meaningless as there have been quite a few changes in this area – partitioning, etc.

  3. Pingback: COUNT STOPKEY – NVL | OraStory

  4. Dom Brooks says:

    There is an alert which can be logged in certain circumstances:
    “MMNL absent for %u secs; Foregrounds taking over”

    This is normally received when the database is in restricted mode (see note 567562.1).

    Whilst this isn’t valid for my circumstances, it is promising that this exists – “promising” in that this is a code path which I’m looking for. It’s just that the circumstances might not be fully documented?

    • Dom Brooks says:

      “Starting with 11.1 version, these messages are written to a trace file rather than the alert.log to prevent messages being written to the alert.log repeatedly many times. “

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: