Is data distribution part of your performance checklist?

I’ve been meaning to blog about this for a while but it’s really hard these days to find interesting things to blog about which have the right balance of words vs technical content. Apart from anything else, it’s hard to get the real world evidence anonymised and out of the workplace and I just don’t have the time to model problems from scratch for blog purposes – and few people seem to read blogs these days…

So there’s some performance problem that has been reported and you start digging. Probably like me you have a few usual suspects you will immediately turn to.

But let’s start with what I definitely do NOT start with:

  • Lists of objects with stale statistics
  • Lists of indexes which some script off the internet has determined need rebuilding
  • AWR reports

We need EVIDENCE. Evidence of the problem. Appropriately scoped evidence of the problem. Maybe it’s far too early to look for the WHYs but certainly we’re on the lookout for WHATs. If something took some time, it was actively doing something or actively waiting on something. So let’s look for that/those something/s.

So what do I start with?

  • Is it still running? If not when was it running?
    • Sources – GV$SESSION, GV$ACTIVE_SESSION_HISTORY, DBA_HIST_ACTIVE_SESS_HISTORY
  • At least 8 times out of 10, we’re probably talking about a specific SQL statement eventually identifiable so hopefully you have a SQL ID (and f it’s not that well that could be the start of another set of blog posts)
  • Assuming that is what it is, what do you go to next?

For me, it’s probably DBA_HIST_SQLSTAT.

I want to compare other executions, if there are any. Depending on how often it executes and how long it executes for, I might aggregate this by AWR snapshot or by day.

What am I looking for ?

  • 9 times out of 10 I expect to find a change in execution plan
  • Is there a large variety of very similarly performing plans?
  • Does a particular plan standout as particularly “bad”?
  • Maybe the optimizer arithmetic makes the mathematical choice quite close but in reality it makes a big difference
  • Maybe there’s a set of parameters which suit one way and not another
  • Hopefully though you can see evidence in some of the metrics which show you, not necessarily why something happened, but what was the contribution to the performance compared to previous executions.
    • For example, you might see that logical IO went through the roof but there can be a myriad of reasons for that, some of which might be caused by other processes
  • Note that the SQL_ID might not be constant – when it is it makes this easier
  • But maybe your code uses literals so every combination of literals is different but at least we can use FORCE_MATCHING_SIGNATURE to identify other executions
  • Or maybe the SQL is completely dynamic which different subqueries, so maybe we’re mining DBA_HIST_SQLTEXT for certain patterns of features of a particular piece of code

Then?

  • For me, it’s probably the real time sql monitoring report (RTSM) for that SQL – DBMS_SQLTUNE.REPORT_SQL_MONITOR
  • What might jump out there?
    • How long has/was it running?
    • Was it parallel? How parallel? Was all the parallelism it wanted to run with granted?
    • Which operations took the longest?
    • Did the optimizer get some of the estimates badly wrong?
    • If it did get some badly wrong, would it have made a difference? e.g.
      • Might it have had an a NESTED LOOP vs HASH JOIN decision
      • Might it have caused undersized workareas which spilled to TEMP?
    • If you’re familiar with the application, even the code, does the execution plan make sense to you?
      • Would you have done it differently? Maybe driven from a different table?
    • Where did the broad categories of time go?
      • CPU vs IO vs Application Waits for example
      • How much data did you read both logically and physically?
    • From the activity detail, if available, are there any wait events which stand out as taking all the time?
  • Note that if there are executions in memory, it can be very useful to compare metrics across specific individual executions by looking at GV$SQL_MONITOR
  • Maybe nothing is still in memory and you have to resort to historical information – see DBA_HIST_REPORTS and DBMS_AUTO_REPORT.REPORT_REPOSITORY_DETAIL?

So we start to build up a picture of contributing factors, elements which might have been cause, might have been effect, but we start joining up the dots.

As mentioned, very often it’s a simple plan change and you can fix it through your tool of choice – sql profile? Sql plan baseline? Sql patch? Change the code?

Often I find that changing the code is the best long term choice particularly if you look at it and think “well I wouldn’t have written it like that.” So often I find a badly performing piece of SQL is not structured in a natural way.

Anyway, what happens when we find that the execution plan is stable?

We can see that one or some executions definitely took a lot longer.

Perhaps it’s time to start looking at some of the data?

Another great thing about the RTSM reports is it tells you about the binds which were used for those executions.

Maybe you’re using binds and you’re comparing apples to oranges with small dates ranges vs large date ranges, etc, etc

Because most of the time, the problem is already identified using one or both of the sources above, I’m usually pretty slow at starting to dig into the actual parameters passed into the procedures.

At this point, well it depends. I might start looking at the tables involved and start slicing and dicing data to see whether there is some ridiculous data skew.

Twice recently I’ve had very notable issues caused not by an issue with the plan, not by anything happening on the database, but by an abuse of data. Normally by other systems sending data in a way they are not meant to but not necessarily in a way that we can control. Maybe not in a way that can be protected by some contraint or other. But in a way that causes a serious execution issue.

Here is the RTSM report


==========================================================================================================================================================
| Id |                           Operation                           |           Name           |  Rows   | Cost |   Time    | Start  | Execs |   Rows   |
|    |                                                               |                          | (Estim) |      | Active(s) | Active |       | (Actual) |
==========================================================================================================================================================
|  0 | SELECT STATEMENT                                              |                          |         |      |      1175 |    +26 |     1 |        0 |
|  1 |   SORT ORDER BY                                               |                          |      17 | 764K |      1175 |    +26 |     1 |        0 |
|  2 |    NESTED LOOPS OUTER                                         |                          |      17 | 764K |      1175 |    +26 |     1 |     202K |
|  3 |     NESTED LOOPS OUTER                                        |                          |      17 | 764K |      1175 |    +26 |     1 |     202K |
|  4 |      VIEW                                                     |                          |      17 | 764K |      1175 |    +26 |     1 |     202K |
|  5 |       VIEW                                                    |                          |      17 | 764K |      1175 |    +26 |     1 |     202K |
|  6 |        WINDOW BUFFER                                          |                          |      17 | 764K |      1178 |    +23 |     1 |     202K |
|  7 |         COUNT STOPKEY                                         |                          |         |      |         3 |    +24 |     1 |       1M |
|  8 |          PX COORDINATOR                                       |                          |         |      |        27 |     +0 |    17 |       1M |
|  9 |           PX SEND QC (ORDER)                                  | :TQ10001                 |      17 | 764K |         3 |    +22 |     8 |       1M |
| 10 |            VIEW                                               |                          |      17 | 764K |         3 |    +22 |     8 |       1M |
| 11 |             SORT ORDER BY STOPKEY                             |                          |      17 | 764K |         4 |    +21 |     8 |       1M |
| 12 |              PX RECEIVE                                       |                          |      17 | 764K |         3 |    +22 |     8 |       2M |
| 13 |               PX SEND RANGE                                   | :TQ10000                 |      17 | 764K |         3 |    +20 |     8 |       2M |
| 14 |                SORT ORDER BY STOPKEY                          |                          |      17 | 764K |        21 |     +2 |     8 |       2M |
| 15 |                 FILTER                                        |                          |         |      |        21 |     +2 |     8 |       2M |
| 16 |                  NESTED LOOPS                                 |                          |      17 | 764K |        21 |     +2 |     8 |       2M |
| 17 |                   PX BLOCK ITERATOR                           |                          |      17 | 764K |        21 |     +2 |     8 |       2M |
| 18 |                    TABLE ACCESS STORAGE FULL                  | ER_A                     |      17 | 764K |        21 |     +2 |   235 |       2M |
| 19 |                   PARTITION HASH ITERATOR                     |                          |       1 |      |        21 |     +2 |    2M |       2M |
| 20 |                    TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED | ER_R                     |       1 |      |        22 |     +1 |    2M |       2M |
| 21 |                     INDEX RANGE SCAN                          | ER_RPK                   |       1 |      |        22 |     +1 |    2M |       2M |
| 22 |      PARTITION HASH ITERATOR                                  |                          |       1 |      |      1175 |    +26 |  202K |     202K |
| 23 |       TABLE ACCESS BY GLOBAL INDEX ROWID                      | DP_S                     |       1 |      |      1175 |    +26 |  202K |     202K |
| 24 |        INDEX UNIQUE SCAN                                      | DP_SPK                   |       1 |      |      1175 |    +26 |  202K |     202K |
| 25 |     VIEW PUSHED PREDICATE                                     |                          |       1 |    2 |      1175 |    +26 |  202K |     202K |
| 26 |      SORT GROUP BY                                            |                          |       1 |    2 |      1175 |    +26 |  202K |     202K |
| 27 |       PARTITION HASH SINGLE                                   |                          |       1 |    2 |      1175 |    +26 |  202K |     273M |
| 28 |        TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED             | DP_A                     |       1 |    2 |      1176 |    +25 |  202K |     273M |
| 29 |         INDEX RANGE SCAN                                      | DP_APK                   |       1 |    1 |      1175 |    +26 |  202K |     273M |
==========================================================================================================================================================



This comes from an event sourcing system, the long term repository for a variety of events being sent between systems triggering workflows, processing and reporting. And used to pass downstream to other controls and consolidation systems.

See line 25, what we’ve got here is a join to an inline view which is trying to offer some protection against “duplicate” events*.

And we can see that in those metrics.

We do an index range scan on line 29 fed by 202K input rows. But the output of those 202K lookups is 273 million rows which then are deduplicated by the SORT GROUP BY on line 26 back to 202K.

Ouch. That is where the pain is.

If we delve into the data for the bind parameters in question, we start to find examples of 64k “duplicates”. The SQL deduplication protection mechanism is expecting a handful of “duplicates”* not many thousands.

*For technical reasons, these are not duplicates. The upstream SORs have transmitted a very similar event many times in a very short space of time (due to a bug probably). The reporting systems will have generated the reports many times because of those events and so we the long term event sourcing system need to keep them. But the downstream systems try and extract additional meaning from these events and therein lies the problem for this code.

Leave a comment