Oracle Support – Grrr

Is this an appropriate response?

Others have raised this before – in fact I think I’ve raised the question before – but my initial search on Oracle Support did not turn up an answer so I thought I would raise a question again.

Problem Summary
—————————————————
Question only: Null predicates in DBA_HIST_SQL_PLAN

Problem Description
—————————————————
Everyone knows that the FILTER_PREDICATES and ACCESS_PREDICATES columns in DBA_HIST_SQL_PLAN are not populated.
If I remember correctly, this was originally a bug in 9i and statspack and has been carried forward into AWR.
As a result, any execution plan fetched using DBMS_XPLAN.DISPLAY_AWR has no predicate information
The statement which populates the underlying data has hardcoded NULLS.
Why do these continue to be null through all versions?
Is there any plan to address this?

I even provided the AWR statement with the hardcoded NULLS:

INSERT INTO wrh$_sql_plan sp (snap_id, dbid, sql_id, plan_hash_value, id, operation, options, object_node, object#, object_owner, object_name, object_alias, object_type, optimizer, parent_id, depth, position, search_columns, cost, cardinality, bytes, other_tag, partition_start, partition_stop, partition_id, other, distribution, cpu_cost, io_cost, temp_space, access_predicates, filter_predicates, projection, time, qblock_name, remarks, timestamp, other_xml) SELECT /*+ ordered use_nl(p) PARAM(‘_flush_plan_in_awr_sql’, 1) PARAM(‘_cursor_plan_unparse_enabled’, ‘FALSE’) */ :lah_snap_id lsnid, :dbid dbid, p.sql_id, p.plan_hash_value, p.id, p.operation, p.options, p.object_node, p.object#, p.object_owner, p.object_name, p.object_alias, p.object_type, p.optimizer, p.parent_id, p.depth, p.position, p.search_columns, p.cost, p.cardinality, p.bytes, p.other_tag, p.partition_start, p.partition_stop, p.partition_id, p.other, p.distribution, p.cpu_cost, p.io_cost, p.temp_space, NULL access_predicates, NULL filter_predicates, NULL projection, p.time, p.qblock_name, p.remarks, p.timestamp, p.other_xml FROM x$kewrattrnew atn, x$kewrtsqlplan spn, v$sql_plan p WHERE atn.str1_kewrattr = spn.sqlid_kewrspe AND atn.num1_kewrattr = spn.planhash_kewrspe AND spn.sqlid_kewrspe = p.sql_id AND spn.planhash_kewrspe = p.plan_hash_value AND spn.childaddr_kewrspe = p.child_address

Response:

Please provide 10046 showing the issue

blah, blah, blah
instructions for doing a trace
blah, blah, blah

Might as well have been an automated response or “did you try turning it off and on again?”
RTFQ…

AWR: Was a baselined plan used?

Sometimes a simple question turns out to be harder than expected.

“Can we see if a particular SQL execution in AWR used a baselined plan?”

Initial thoughts might be:

Q: Does DBMS_XPLAN.DISPLAY_AWR tell us this?
A: Apparently not. See below. This question could also be rephrased as two other possible questions:

Q:Isn’t there a column on DBA_HIST_SQLSTAT which tell us this?
A: No. You’d think there should be. There is a SQL_PROFILE column. There isn’t a SQL_PLAN_BASELINE column.
There also isn’t an EXACT_MATCHING_SIGNATURE although there is a FORCE_MATCHING_SIGNATURE.

Q: Is it in DBA_HIST_SQL_PLAN.OTHER_XML?
A. No although this is where DBMS_XPLAN.DISPLAY_AWR gets it’s notes about cardinality feedback and dynamic sampling from.

First of all, Let’s see that it’s not shown in DBMS_XPLAN.DISPLAY_AWR.
Let’s get a sql statement baselined, in AWR and not in memory.

The usual setup

SQL> create table t1
  2  (col1  number
  3  ,col2  varchar2(50)
  4  ,flag  varchar2(2));

Table created.

SQL> insert into t1
  2  select rownum
  3  ,      lpad('X',50,'X')
  4  ,      case when rownum = 1
  5              then 'Y1'
  6              when rownum = 2
  7              then 'Y2'
  8              when mod(rownum,2) = 0
  9              then 'N1'
 10              else 'N2'
 11         end
 12  from   dual
 13* connect by rownum <= 100000
SQL> /

100000 rows created.

SQL> commit;

Commit complete.

Get a plan and put it in a baseline:

SQL> var n varchar2(2);
SQL> exec :n := 'N1';

PL/SQL procedure successfully completed.

SQL> select count(*), max(col2) from t1 where flag = :n;
select * from table(dbms_xplan.display_cursor);
  COUNT(*) MAX(COL2)
---------- --------------------------------------------------
     49999 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

SQL> select * from table(dbms_xplan.display_cursor);


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
SQL_ID  731b98a8u0knf, child number 0
-------------------------------------
select count(*), max(col2) from t1 where flag = :n

Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |   221 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |    30 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   | 50254 |  1472K|   221   (1)| 00:00:03 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("FLAG"=:N)

Note
-----
   - dynamic sampling used for this statement (level=2)

SQL> declare
  2   l_op pls_integer;
  3  begin
  4    l_op :=
  5    dbms_spm.load_plans_from_cursor_cache('731b98a8u0knf');
  6* end;
SQL> /

PL/SQL procedure successfully completed.

Verify that DBMS_XPLAN.DISPLAY_CURSOR reports baseline usage:

SQL> select count(*), max(col2) from t1 where flag = :n;

  COUNT(*) MAX(COL2)
---------- --------------------------------------------------
     49999 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

SQL> select * from table(dbms_xplan.display_cursor);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
SQL_ID  731b98a8u0knf, child number 0
-------------------------------------
select count(*), max(col2) from t1 where flag = :n

Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |   221 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |    30 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   | 50254 |  1472K|   221   (1)| 00:00:03 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("FLAG"=:N)

Note
-----
   - dynamic sampling used for this statement (level=2)
   - SQL plan baseline SQL_PLAN_13w748wknkcwd616acf47 used for this statement

See that dynamic_sampling note most likely comes from OTHER_XML and Baseline note doesn’t, it comes from V$SQL:

SQL>select sql_id
  2  ,      to_char(exact_matching_signature) sig
  3  ,      plan_hash_value
  4  ,      sql_plan_baseline
  5  from   v$sql
  6* where  sql_id = '731b98a8u0knf'
SQL> /

SQL_ID        SIG                                      PLAN_HASH_VALUE SQL_PLAN_BASELINE
------------- ---------------------------------------- --------------- ------------------------------
731b98a8u0knf 1292784087274697613                           3724264953 SQL_PLAN_13w748wknkcwd616acf47

SQL> set long 10000
SQL> select other_xml
  2  from   v$sql_plan
  3  where  sql_id = '731b98a8u0knf'
  4* and    other_xml is not null
SQL> /

OTHER_XML
--------------------------------------------------------------------------------
<other_xml><info type="db_version">11.2.0.3</info><info type="parse_schema"><![C
DATA["E668983_DBA"]]></info><info type="dynamic_sampling">2</info><info type="pl
an_hash">3724264953</info><info type="plan_hash_2">1634389831</info><peeked_bind
s><bind nam=":N" pos="1" dty="1" csi="178" frm="1" mxl="32">4e31</bind></peeked_
binds><outline_data><hint><![CDATA[IGNORE_OPTIM_EMBEDDED_HINTS]]></hint><hint><!
[CDATA[OPTIMIZER_FEATURES_ENABLE('11.2.0.3')]]></hint><hint><![CDATA[DB_VERSION(
'11.2.0.3')]]></hint><hint><![CDATA[OPT_PARAM('query_rewrite_enabled' 'false')]]
></hint><hint><![CDATA[ALL_ROWS]]></hint><hint><![CDATA[OUTLINE_LEAF(@"SEL$1")]]
></hint><hint><![CDATA[FULL(@"SEL$1" "T1"@"SEL$1")]]></hint></outline_data></oth
er_xml>

Now, let’s add statement to AWR, do an AWR snap and flush the SP:

SQL> exec dbms_workload_repository.add_colored_sql('731b98a8u0knf');

PL/SQL procedure successfully completed.

SQL> exec dbms_workload_repository.create_snapshot;

PL/SQL procedure successfully completed.

SQL> alter system flush shared_pool;

System altered.

Now if we look back at the plan in memory, there’s nothing there.

SQL> select * from v$sql where sql_id = '731b98a8u0knf';

no rows selected

SQL> select * from table(dbms_xplan.display_cursor('731b98a8u0knf'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------
SQL_ID: 731b98a8u0knf, child number: 0 cannot be found

But the SQL is in AWR and let’s see that DBMS_XPLAN.DISPLAY_AWR does not tells us that a baseline is used:

SQL> select * from table(dbms_xplan.display_awr('731b98a8u0knf'));

PLAN_TABLE_OUTPUT
---------------------------------------------------------------
SQL_ID 731b98a8u0knf
--------------------
select count(*), max(col2) from t1 where flag = :n

Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |   221 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |    30 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 50238 |  1471K|   221   (1)| 00:00:03 |
---------------------------------------------------------------------------

Note
-----
   - dynamic sampling used for this statement (level=2)

But it will tell us if a sql profile is used because it can via DBA_HIST_SQLSTAT.SQL_PROFILE – I leave that exercise to the reader.

So, back to the original question:

“Can we see if a particular SQL execution in AWR used a baselined plan?”

I’ve never needed to do this but as an exercise we can infer this connection by the presence of a baselined plan with the same phv2 but it’s convoluted.

It may be that there are simpler ways but it seems to me that we need to:

  1. Get the exact_matching_signature from the sql text
  2. Get the phv2 out of DBA_HIST_SQL_PLAN.OTHER_XML
  3. Match that phv2 with planid which is not exposed in DBA_SQL_PLAN_BASELINES but is in the underlying SYS.SQLOBJ$ tables
SQL> with subq_mysql as
  2  (select sql_id
  3   ,      (select dbms_sqltune.sqltext_to_signature(ht.sql_text)
  4           from dual) sig
  5   from   dba_hist_sqltext       ht
  6   where  sql_id = '731b98a8u0knf')
  7  ,    subq_baselines as
  8  (select b.signature
  9   ,      b.plan_name
 10   ,      b.accepted
 11   ,      b.created
 12   ,      o.plan_id
 13   from   subq_mysql             ms
 14   ,      dba_sql_plan_baselines b
 15   ,      sys.sqlobj$            o
 16   where  b.signature   = ms.sig
 17   and    o.signature   = b.signature
 18   and    o.name        = b.plan_name)
 19  ,    subq_awr_plans as
 20  (select  sn.snap_id
 21   ,       to_char(sn.end_interval_time,'DD-MON-YYYY HH24:MI') dt
 22   ,       hs.sql_id
 23   ,       hs.plan_hash_value
 24   ,       t.phv2
 25   ,       ms.sig
 26   from    subq_mysql        ms
 27   ,       dba_hist_sqlstat  hs
 28   ,       dba_hist_snapshot sn
 29   ,       dba_hist_sql_plan hp
 30   ,       xmltable('for $i in /other_xml/info
 31                     where $i/@type eq "plan_hash_2"
 32                     return $i'
 33                    passing xmltype(hp.other_xml)
 34                    columns phv2 number path '/') t
 35   where   hs.sql_id          = ms.sql_id
 36   and     sn.snap_id         = hs.snap_id
 37   and     sn.instance_number = hs.instance_number
 38   and     hp.sql_id          = hs.sql_id
 39   and     hp.plan_hash_value = hs.plan_hash_value
 40   and     hp.other_xml      is not null)
 41  select awr.*
 42  ,       nvl((select max('Y')
 43               from   subq_baselines b
 44               where  b.signature = awr.sig
 45               and    b.accepted  = 'YES'),'N') does_baseline_exist
 46  ,      nvl2(b.plan_id,'Y','N') is_baselined_plan
 47  ,      to_char(b.created,'DD-MON-YYYY HH24:MI')  when_baseline_created
 48  from   subq_awr_plans awr
 49  ,      subq_baselines b
 50  where  b.signature (+) = awr.sig
 51  and    b.plan_id   (+) = awr.phv2
 52* order by awr.snap_id
SQL> /

   SNAP_ID DT                SQL_ID        PLAN_HASH_VALUE       PHV2        SIG D I WHEN_BASELINE_CRE
---------- ----------------- ------------- --------------- ---------- ---------- - - -----------------
      8703 05-FEB-2014 15:18 731b98a8u0knf      3724264953 1634389831 1.2928E+18 Y Y 05-FEB-2014 15:08

SQL>

Surprising.
Unless, of course, I’ve overlooked something bleeding obvious.

Plan_hash_value and internal temporary table names

Here’s a little thing about plan_hash_2 that’s come via a pointer from Jonathan Lewis to bug 10162430.

From querying DBA_HIST_SQLSTAT I happened to notice that over a period of 65 days a particular query had 63 different plans (PLAN_HASH_VALUE).

I found that the differences were down done to the different internal temporary table names used for a materialized subquery, i.e. the plans were essentially the same but for the name of SYS_TEMP_0FD9D6786_B6EF87D2, SYS_TEMP_0FD9D6825_BE8671F, SYS_TEMP_0FD9D6684_EAD00827, etc.

Exactly what bug 10162430 addresses is unclear because it is listed as fixed in 11.2.0.3 and my different plan hashes come from an 11.2.0.3 database.

But the key information in the bug description highlighted that PLAN_HASH_2 is not affected by these differing temp table names.

Exactly what plan_hash_2 is and how it differs from plan_hash_value is undocumented but:

  • it is found in V$SQL_PLAN.OTHER_XML;
  • also in the AWR repository DBA_HIST_SQL_PLAN.OTHER_XML;
  • and is used as the enforcing plan hash in sql plan baselines – not exposed in DBA_SQL_PLAN_BASELINES but internally represented in column PLAN_ID in SYS.SQLOBJ$, SYS.SQLOBJ$DATA, SYS.SQLOBJ$AUXDATA.

So, if you regularly look at DBA_HIST_SQLSTAT to look at historic execution statistics of a SQL statement and your queries use materialized subqueries then I recommend adding PLAN_HASH_2 to the output, something like this:

select sn.snap_id
,      sn.end_interval_time
,      st.module
,      st.sql_id
,      st.plan_hash_value
,      (select extractvalue(xmltype(other_xml),'other_xml/info[@type="plan_hash_2"]')
        from   dba_hist_sql_plan hp
        where  hp.sql_id          = st.sql_id
        and    hp.plan_hash_value = st.plan_hash_value
        and    hp.other_xml is not null) plan_hash_2
,      rows_processed_delta rws
,      executions_delta     execs
,      elapsed_time_delta   elp
,      cpu_time_delta       cpu
,      buffer_gets_delta    gets
,      iowait_delta         io
from   dba_hist_snapshot sn
,      dba_hist_sqlstat  st
where  st.snap_id            = sn.snap_id
and    st.sql_id             = '&sql_id'
and    st.elapsed_time_delta > 0
order by sn.snap_id desc; 

And just to show this difference between PHV and PH2:

   SNAP_ID END_INTERVAL_TIME         PLAN_HASH_VALUE PLAN_HASH_2
---------- ------------------------- --------------- ------------
      9131 12/07/2012 00:00               1987570133 3399935153
      9083 11/07/2012 00:00               1567016512 3399935153
      9035 10/07/2012 00:00               3386529786 3399935153
      8891 07/07/2012 00:00               2197008736 3399935153
      8842 05/07/2012 23:30                991904690 3399935153
      8794 04/07/2012 23:30               1331394991 3399935153
      8746 03/07/2012 23:30               2577119670 3399935153
      8699 03/07/2012 00:00               3774955286 3399935153
      8698 02/07/2012 23:30               3774955286 3399935153
      8555 30/06/2012 00:00               3308813978 3399935153
      8507 29/06/2012 00:00                796849390 3399935153
      8459 28/06/2012 00:00                917735929 3399935153
      8410 26/06/2012 23:30                139934889 3399935153
      8363 26/06/2012 00:00               1936665492 3399935153
      8219 23/06/2012 00:00                666306815 3399935153
      8171 22/06/2012 00:00               1053584101 3399935153
      8123 21/06/2012 00:00               1353471101 3399935153
...

What job runs AWR snapshot? Eh?

There are a surprising number of online resources propagating the idea that the collecting of AWR snapshots is somehow connected to the GATHER_STATS_JOB (which was responsible for automatically gathering stats in 10g).

Eh?

Collecting AWR snapshots is one of the responsibilities of the background process MMON.

If you have a problem with the automatic snapshots:
– Check whether you can manually snap using DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT.
– Check MMON and alert log / trace files for related messages.
– See Metalink note 1301503.1: “Troubleshooting: AWR Snapshot Collection issues”

This came up in a question on the OTN forums where the OP said that AWR was not snapping automatically but the GATHER_STATS_JOB seemed to be ok. And the obvious first question would be where did you get that idea from?
Sometimes you can’t win on the forums. If you ask a question, it’s not uncommon to be pointed to a google search. But if you google something, you can’t trust everything that’s out there – it gets outdated and sometimes it never was accurate.

Plans gone AWRy

This is a follow-up to a database performance problem which happened a couple of weeks back, first briefly mentioned in this previous article of mine.

You’ve probably heard it a thousand times before and there’s no significant variance in this case nor revelation in resolution:

  • The Business have reported a performance problem.
  • Some reports which normally take a couple of minutes have been taking closer to one hour.
  • Some DBAs / managers / developers know that when this problem has happened previously, if they flushed the shared pool, the problem mysteriously went away – so that’s what they’ve already done.

The first area of comment is the flushing of the shared pool.

Ok, so it’s a bit of a case of using a sledgehammer to crack a nut but the business have deadlines. With production performance problems and business deadlines, sometimes you’ve got no choice.

Then again, nobody knew it would work – fortunately, from some perspectives, it did work. But, until the root cause is identified and fixed, it’s always going to be uncertain. And “flush the shared pool” is always going to be the battle cry of the panicked, like “reboot your PC”.

Clearly, this is a system that can cope with a flushing of the shared pool. Not true of many.

So, the shared pool has been flushed and we’ve thrown the baby – the evidence & some of the causes – out with the bath water.

However, if you’re appropriately licensed on 10g onwards, you’ve got AWR and ASH to fall back on.

Quickly though, how long should your AWR retention be?

The default is 7 days.
But, as Doug points out in one of his comments, in some organisations by the time you’ve got access five or six days might have passed.

35 and 42 days are oft-mentioned sensible suggestions, mostly so that a) you’ve got a decent time buffer to react and b) you’ve got significant cover to allow for most business cycles of activity.

It’s set to 7 here which is a bit of pain.
Fortunately, I got access to the data within a couple of hours of requesting an account. But even so, I was working on other stuff at the same time and would occasionally think of things to double check but it was too late – the AWR and ASH data was long gone (also I’m writing this article about two weeks after I originally intended to and I wanted to post the queries and data from the actual problem but what with the disruption of the snow and Christmas, the data to match the queries has gone).

Anyway, you’ve been told something has been slow, the problem’s over now, the pool’s flushed, but people want an explanation – there’s an invASHtigation to be done.

In general, when I get access to production, I get data dictionary access. No access to underlying objects, no access to GUIs.

This is part of what I call a blind investigation.
I’m new to the system so I don’t know much about it, I’m not familiar with the SQL or objects, etc.
But I might have looked at some Statspack or AWR reports to get a general picture of activity.

DBA_HIST_ACTIVE_SESS_HISTORY is nearly always my starting place, or V$ACTIVE_SESSION_HISTORY if the data has not been flushed yet, to look for non-standard waiting.

First up though, you need to have asked some basic questions:

  • Is/Was the problem system-wide or limited to a few sessions?
  • If limited, limited to who? Connecting as which user? Connecting via which application?
  • What was the timeframe of the problem

If the problem is system-wide, identifying it from looking at the ASH data for particular sessions might be tricky.

If the database is not the source of the problem, then using AWR and ASH might not give any clues.

It’s not terribly scientific, but paging quickly through and aggregating the ASH data can often give a good “feel” for whether the database was experiencing any non-standard extra time in any particular timeframe (although comparing system-level pictures of activity like statspack and AWR reports might be better).

In this case, I opened up with a query a bit like this, targetting a small timeframe when the problem was almost certainly to have been occurring for the end-user:

select to_char(sample_time,'HH24:MI:SS') time, session_id sid
, session_serial# s#, user_id usr, session_state state
, time_waited waited, blocking_session blk_by, blocking_session_status blk_stat
, sql_id, sql_child_number child_id, sql_plan_hash_value hash
, sql_plan_operation plan_op
, event, p1text, p1, p2text, p2, p3text p3, wait_class
, top_level_sql_id top_id, sql_exec_start ex_start
, plsql_entry_object_id pl_obj, plsql_entry_subprogram_id pl_prog
from   dba_hist_active_sess_history
where sample_time between <start_time> and <end_time>
order by time desc;

Then adding a predicate to run a similar query targetted on the particular user/s experiencing the problem:

and user_id = <user_id>

Followed by zooming in on a particular session if one stands out in the data above:

and session_id = <session_id>
and  session_serial# = <serial#>

With this particular performance problem, it was quickly obvious that this was a case of good plan gone bad with the affected sessions not blocked or waiting but mostly ON CPU running the same SQL_ID and the same SQL_PLAN_HASH_VALUE.

By changing the query to look at counts of a particular hash, it might be possible to find if it’s a particular plan hash that is causing a problem and when it came along:

select sql_plan_hash_value, COUNT(*), MIN(sql_exec_start), MAX(sql_exec_start)
from   dba_hist_active_sess_history
where  sql_id = <sql_id>
group by sql_plan_hash_value;

This query revealed a big skew towards a particular plan which had first appeared in the sampled data at 6 in the morning that day.

Another view which may reinforce your thinking here is DBA_HIST_SQL_STAT. For example, you can get a rough idea of the relative work of one plan over another using a query like this:

select plan_hash_value, optimizer_cost, optimizer_mode, optimizer_env_hash_value, buffer_gets_delta/executions_delta
from   dba_hist_sqlstat
where sql_id = <sql_id>
and   executions_delta > 0;

The plans were gone out of memory but there’s a handy function to pull the plan from AWR:

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_AWR(<sql_id>,<plan_hash_value>,<db_id>,<format>));

Which works really well as long as you don’t hit the display issues you can get with large plans.

If you do hit these display issues then you might prefer to pull your plans from dba_hist_sql_plan for example using a variation of Tanel’s script here.

select
    id      xms_id,
    lpad(' ',depth*1,' ')||operation || ' ' || options xms_plan_step, 
    object_name     xms_object_name,
    cost        xms_opt_cost,
    cardinality xms_opt_card,
    bytes       xms_opt_bytes,
    optimizer   xms_optimizer
from 
    dba_hist_sql_plan 
where 
    plan_hash_value = <plan_hash>
order by id

One big, big downside of pulling the plan from AWR and not from v$sql_plan is that the access_predicates and filter_predicates columns are not populated. The columns are there but the information has not been copied across. A big oversight in the implementation surely.

Having got a bad plan and some not-so-bad plans (same sql id, different plan hash values), then we can start to look at the differences and what might have caused them.

I’m having trouble with uploading and/or presenting these plans in a way that works. WordPress won’t let me upload text files, the plans are 3000-odd lines long and the various options that I’ve tried so far do not give a satisfactory result.

Actual plans aside, suffice to say that there are significant differences between good and bad plans with the bad plans generally employing full table scans against a big table rather than index lookups but there can be an awful lot of distractions in a big plan.

So … so far… good plan, bad plan, same sql, obvious differences. Why? Where else can we look?

There’s one more place which might help – DBA_HIST_SQLBIND.
This should have captured the peeked binds used when our bad plan was parsed along with any binds captured at regular intervals. Unfortunately, we can only tie in with the SQL_ID as there’s no sql plan hash value here so we’re really looking at LAST_CAPTURED in conjunction with the dates/times from the other views.

So. In conclusion…

The details are a bit vague and I apologise for this. I took too long before starting this article and ended up with a good illustration of why 7 days is too short for your AWR retention. However, AWR & ASH can be a very good way of identifying the causes of plan changes.

In this case, it seems highly likely that a particular bit of sql got a bad plan.
It might not have been bad for the particular combination of binds which were peeked at at parse time, but at least bad for the majority of uses that day.

There is, sadly, still just an element of theory about the causes of this particular performance problem.
I can see the bad plan. I can see the plans which work better in that the performance is acceptable for everyone (better does not necessarily mean good). I can see the binds which produced the better plans. I can see the binds which produced the band plan. But, as yet, running the query with the binds which produced the bad plan is not producing the bad plan :(.

And that’s the outstanding problem – if you can’t reproduce something at will, how do you know if you’ve fixed something? Can you prove you’ve understood the whole problem? You can’t know. Not for sure.

I hope to figure out the best way to upload these plans in another article.

This particular problem is almost certainly a case of bind variable peeking – arguably where bind variables shouldn’t be used. On a reporting system where the reports typically take some minutes, the overhead of hard parsing shouldn’t really be a consideration – this seems to be a system where the mantra “you must use bind variables” has been applied inappropriately. The reports take in date parameters which might be historic date ranges or might just be the latest month – a case of skew, a case of one plan does not suit all, a case for literals.

Any ideas, thoughts, mistakes spotted, how to post big plans, let me know.

Some other ASH / AWR resources:
Tracking the bind variable – Dion Cho
Doug’s AWR tagged posts
Doug’s ASH tagged posts
Tracking Parameter Changes – Kerry Osborne

Bug in 11g AWR report

Post 11g upgrade, one of the first things I noticed was in the AWR reports.

In the Instance Efficiency Percentages, the “Parse CPU to Parse Elapsd %” was a less than impressive “0.00”. I’ve become accustomed to this being low in reports here, but not that low ;-) .

This has been confirmed as a new bug but the support analyst was not hopeful of it being fixed at any point soon (periods of not months but years and 12g was mooted).

It’s not a big deal. The forumula for this ratio is an unsurprising 100*(parse time cpu / parse time elapsed), figures which can be found elsewhere in AWR (and statspack of course) in the Instance Activity Stats.

So, the figure that should have reported was 85% – still less than impressive.

I’m slightly surprised that something so obvious slipped through the net but then again maybe not if it’s unlikely that more than 1% of Oracle sites are running 11g.

Follow

Get every new post delivered to your Inbox.

Join 72 other followers