Oracle 19c Upgrade: Bug 31602782 – Same Instance Slave Parse Failure Flood Control

It’s almost a universal truth that for any non-trivial application you can expect to hit one or two serious incidents in production post-upgrade, most often due to some combination of workload and/or timing which just didn’t happen in testing beforehand.

In this instance (/these instances), it was the bug 31602782, the patch for which has the above title, the doc id description being “Contention on CURSOR: Pin S wait on X when PQ slave’s execution plan does not match with QC.

This has been around since 12.2 and is first included in 20.1. The doc describes it as

When a parallel query is executed, QC sends parse message to all slaves in all instances…. if the slave could not build the cursor with the same plan as QC and could not join, another slave tries the same and so on until all slaves get exhausted

In practice, what this meant for us was avery large and significant contention on library cache lock and cursor: pin s wait for x affecting all data loads, large parsing backlogs, issues with anything then wanting library cache lock on the same underlying objects – e.g.partition maintenance operations, etc

There were two specific processes involved and had we not got a bug diagnosis when we did, I would have started to rewrite those processes. It is not uncommon that the flows which end up hitting bugs and performance issues have mileage left in them in terms of optimisation. In hindsight, I believe we did hit this problem in UAT but not in this area and in UAT I had rewritten the process in question as it was problematic anyway,

Certainly once this patch was applied, our most urgent post-upgrade issues were largely resolved.

As an aside, I would urge anyone looking at an upgrade to review beforehand the list of Important Recommended One-off Patches which for 19c is Doc Id: 2720807.1

Oracle 19c Upgrade: Query reading way more partitions than it should

Still a work in progress… but despite a query plan which details that partition pruning should be happening, evidence (both physical reads and obvious timing) suggest that an OLTP-style query is reading all table partitions.

There is no evidence to suggest that this was happening before the upgrade (11.2.0.4 to 19.6) and plenty of evidence of it afterwards. Not consistent suggesting some significant triggering event.

Oracle: Groundbreakers Developer Community https://community.oracle.com/tech/developers/discussion/4480886/partition-list-iterator-reading-too-many-partitions/ currently has all the latest information. If I ever get to a conclusion, then I might collate & summarise here.

COMPATIBLE Identifier Intrigue

An observation about a crucial difference whilst testing an upgrade with / without COMPATIBLE.

Upgrade from 11.2.0.4 to 19.6
Initial upgrade testing done with default COMPATIBLE of 19.0.0
Another enviroment was upgraded with COMPATIBLE set to 11.2.0.4.0, rationale being that it allowed for “easy” downgrade should we run into unresolvable issues.

For background reason, please see Mike Dietrich blog on COMPATIBLE: https://mikedietrichde.com/2019/04/17/when-and-how-should-you-change-compatible/
I am not criticising this article at all.
There is an example in the blog about identifier being such a feature impacted by COMPATIBLE.And there are some interesting points in the discussions about the impact of testing.

We had an issue with a view on top of a PIVOT query.
PIVOT is slightly unusual in that the resulting column names are dynamic.

This example is a bit ridiculous… however… Let’s say we had a pivot a bit like this:

select owner, created_year, "January can be a cold month _T"
  from (
         select owner
              , to_number(to_char(created,'YYYY')) created_year
              , to_char(created,'MON') created_month
              , object_type
           from dba_objects
           where owner in ('SYSTEM','SYS')
       )
pivot (
         max(object_type) as type, count(*) as cnt for created_month in
         ('JAN' as "January can be a cold month "
         ,'FEB' as "February can be a bit miser"
         ,'MAR' as "Mad as a Hare"
         ,'APR' as "Showers"
         ,'MAY' as "Can be magnificent"
         ,'JUN' as "Flamin'"
         ,'JUL' as "Peak Summer"
         ,'AUG' as "Often disappointing"
         ,'SEP' as "Always a bonus"
         ,'OCT' AS "Neither here nor there"
         ,'NOV' as "All the best babies born he"
         ,'DEC' as "Christmas? Already?")
       )
order
    by owner
     , created_year;
OWNER                          CREATED_YEAR January can be a co
------------------------------ ------------ -------------------
SYS                                    2014
SYS                                    2015
SYS                                    2016 DIRECTORY
SYS                                    2017
SYS                                    2018 TABLE SUBPARTITION
SYS                                    2019
SYS                                    2020
SYS                                    2021 TABLE PARTITION
SYSTEM                                 2014                    

9 rows selected.

You can see our final column selection (which might be a view definition in the real world ?!?!??!), is relying on the 11g implicit identifier truncation to 30 characters of “January can be a cold month _T”.
Unwise… in hindsight.

So we fix this and release it to our default 19.6 COMPATIBLE database.

select owner, created_year, "January can be a cold month _TYPE"
  from (
         select owner
              , to_number(to_char(created,'YYYY')) created_year
              , to_char(created,'MON') created_month
              , object_type
           from dba_objects
           where owner in ('SYSTEM','SYS')
       )
pivot (
         max(object_type) as type, count(*) as cnt for created_month in
         ('JAN' as "January can be a cold month "
         ,'FEB' as "February can be a bit miser"
         ,'MAR' as "Mad as a Hare"
         ,'APR' as "Showers"
         ,'MAY' as "Can be magnificent"
         ,'JUN' as "Flamin'"
         ,'JUL' as "Peak Summer"
         ,'AUG' as "Often disappointing"
         ,'SEP' as "Always a bonus"
         ,'OCT' AS "Neither here nor there"
         ,'NOV' as "All the best babies born he"
         ,'DEC' as "Christmas? Already?")
       )
order
    by owner
     , created_year;
	 OWNER      CREATED_YEAR January can be a cold m
---------- ------------ -----------------------
SYS                2015
SYS                2016 DIRECTORY
SYS                2017
SYS                2018
SYS                2019
SYS                2020
SYS                2021 VIEW
SYSTEM             2015
SYSTEM             2021 VIEW                   

9 rows selected.

And then we release it to the other compatible = 11.2.0.4 19c database. Well, I’m sure that there aren’t going to be any surprises that it doesn’t bloody work:

select owner, created_year, "January can be a cold month _TYPE"
  from (
         select owner
              , to_number(to_char(created,'YYYY')) created_year
              , to_char(created,'MON') created_month
              , object_type
           from dba_objects
           where owner in ('SYSTEM','SYS')
       )
pivot (
         max(object_type) as type, count(*) as cnt for created_month in
         ('JAN' as "January can be a cold month "
         ,'FEB' as "February can be a bit miser"
         ,'MAR' as "Mad as a Hare"
         ,'APR' as "Showers"
         ,'MAY' as "Can be magnificent"
         ,'JUN' as "Flamin'"
         ,'JUL' as "Peak Summer"
         ,'AUG' as "Often disappointing"
         ,'SEP' as "Always a bonus"
         ,'OCT' AS "Neither here nor there"
         ,'NOV' as "All the best babies born he"
         ,'DEC' as "Christmas? Already?")
       )
order
    by owner
     , created_year;
ORA-00972: identifier is too long
00972. 00000 -  "identifier is too long"
*Cause:    An identifier with more than 30 characters was specified.
*Action:   Specify at most 30 characters.
Error at Line: 128 Column: 29

So what?
We just use the old one right?
No – it doesn’t bloody work either:

select owner, created_year, "January can be a cold month _T"
  from (
         select owner
              , to_number(to_char(created,'YYYY')) created_year
              , to_char(created,'MON') created_month
              , object_type
           from dba_objects
           where owner in ('SYSTEM','SYS')
       )
pivot (
         max(object_type) as type, count(*) as cnt for created_month in
         ('JAN' as "January can be a cold month "
         ,'FEB' as "February can be a bit miser"
         ,'MAR' as "Mad as a Hare"
         ,'APR' as "Showers"
         ,'MAY' as "Can be magnificent"
         ,'JUN' as "Flamin'"
         ,'JUL' as "Peak Summer"
         ,'AUG' as "Often disappointing"
         ,'SEP' as "Always a bonus"
         ,'OCT' AS "Neither here nor there"
         ,'NOV' as "All the best babies born he"
         ,'DEC' as "Christmas? Already?")
       )
order
    by owner
     , created_year;
ORA-00904: "January can be a cold month _T": invalid identifier
00904. 00000 -  "%s: invalid identifier"
*Cause:
*Action:
Error at Line: 100 Column: 29

Fair enough… What should it be then Mr Compatible?
Give us a clue:

select *
  from (
         select owner
              , to_number(to_char(created,'YYYY')) created_year
              , to_char(created,'MON') created_month
              , object_type
           from dba_objects
           where owner in ('SYSTEM','SYS')
       )
pivot (
         max(object_type) as type, count(*) as cnt for created_month in
         ('JAN' as "January can be a cold month "
         ,'FEB' as "February can be a bit miser"
         ,'MAR' as "Mad as a Hare"
         ,'APR' as "Showers"
         ,'MAY' as "Can be magnificent"
         ,'JUN' as "Flamin'"
         ,'JUL' as "Peak Summer"
         ,'AUG' as "Often disappointing"
         ,'SEP' as "Always a bonus"
         ,'OCT' AS "Neither here nor there"
         ,'NOV' as "All the best babies born he"
         ,'DEC' as "Christmas? Already?")
       )
order
    by owner
     , created_year;
OWNER      CREATED_YEAR January can be a cold m
---------- ------------ -----------------------
SYS                2015
SYS                2016 DIRECTORY
SYS                2017
SYS                2018
SYS                2019
SYS                2020 TABLE PARTITION
SYS                2021 VIEW
SYSTEM             2015
SYSTEM             2021 VIEW

Ok. Give me a bit more of a clue please:

create table dom1 as
select *
  from (
         select owner
              , to_number(to_char(created,'YYYY')) created_year
              , to_char(created,'MON') created_month
              , object_type
           from all_objects
           where owner in ('SYSTEM','SYS')
       )
pivot (
         max(object_type) as type for created_month in
         ('JAN' as "January can be a cold month ")
       )
order
    by owner
     , created_year;

Table DOM1 created.
select * from dom1;
OWNER      CREATED_YEAR January can be a cold m
---------- ------------ -----------------------
SYS                2015
SYS                2016
SYS                2017
SYS                2018
SYS                2019
SYS                2021 VIEW
SYSTEM             2015
SYSTEM             2021 VIEW

You’re playing hard to get…

select listagg('"'||column_name||'"',',') within group(order by column_id) from user_tab_columns where table_name = 'DOM1';
LISTAGG('"'||COLUMN_NAME||'"',',')WITHINGROUP(ORDERBYCOLUMN_ID)
-----------------------------------------------------------------
"OWNER","CREATED_YEAR","January can be a cold month _TYPE"
select "OWNER","CREATED_YEAR","January can be a cold month _TYPE" from dom1;
ORA-00972: identifier is too long
00972. 00000 -  "identifier is too long"
*Cause:    An identifier with more than 30 characters was specified.
*Action:   Specify at most 30 characters.
Error at Line: 225 Column: 36</code></pre>
select "OWNER","CREATED_YEAR","January can be a cold month _TYPE"
from (select *
  from (
         select owner
              , to_number(to_char(created,'YYYY')) created_year
              , to_char(created,'MON') created_month
              , object_type
           from all_objects
           where owner in ('SYSTEM','SYS')
       )
pivot (
         max(object_type) as type for created_month in
         ('JAN' as "January can be a cold month ")
       )
order
    by owner;
ORA-00972: identifier is too long
00972. 00000 -  "identifier is too long"
*Cause:    An identifier with more than 30 characters was specified.
*Action:   Specify at most 30 characters.
Error at Line: 227 Column: 31

OK… enough already… I’m not playing these games 🙂

The obvious solution is not to go anywhere near the limit in the first place. This was just a bit of messing around.

Parallel Window Consolidator Calls The Stops

Sharing observations of performance issue arising out of testing 19.6 upgrade from 11.2.0.4 (bug possibly from 12+ looking at the fixes which work).

Sharing this one in particular as it doesn’t seem that common from anecdotal evidence (forums, blogs, support), it doesn’t appear fixed yet, progress with support has been disappointing, however circumstances don’t seem particularly niche.

We’ve had a few issues with a number of parallel queries just “hanging” forever.

Issue seems to be related to WINDOW CONSOLIDATOR and the parallel distribution method.

Haven’t been able to find a good matching bug via Oracle Support (nor happy with the progress of SR with them) but found a good match on a Lothar Flatz blog post which led me to the PQ_DISTRIBUTE_WINDOW as cause.

SQL contains parallel + join to view, view contains analytic
SQL Executes in under a second “normally”.

Without a fix, the SQL will just hang for a long time. The longest I’ve left it is 3 days but given the wait states of the sessions involved, there’s no reason to think it would stop of its own accord.

Jumping ahead to the workarounds which work, any of the following:

  • Turn off fix_control 13345888 via alter session or opt_param
  • Turn off “_adaptive_window_consolidator_enabled” via alter session/system or opt_param
  • Hint a PQ distribution method other than 2. 3 doesn’t work here so 1 is the other option – PQ_DISTRIBUTE_WINDOW(@query_block 1)

Here we see some evidence from just now, running for nearly 30 minutes:

select s.inst_id i,s.sid,s.serial#,s.module
 ,      s.blocking_session blks,s.blocking_instance bi,s.final_blocking_session f_blk_s, s.final_blocking_instance fbi
 ,      CASE WHEN state != 'WAITING' THEN 'WORKING' ELSE 'WAITING' END AS state
 ,      CASE WHEN state != 'WAITING' THEN 'On CPU / runqueue' ELSE event END AS sw_event, s.seconds_in_wait wait
 ,      s.sql_id,s.sql_exec_id,to_char(s.sql_exec_start,'DD-MON-YYYY HH24:MI:SS') exec_start
 from   gv$session    s
 where  sql_id = '5xyrp7v4djagv';

         I        SID    SERIAL# MODULE                     BLKS         BI    F_BLK_S        FBI STATE   SW_EVENT                        WAIT SQL_ID        SQL_EXEC_ID EXEC_START                   
---------- ---------- ---------- -------------------- ---------- ---------- ---------- ---------- ------- ------------------------- ---------- ------------- ----------- -----------------------------
         2       2805      15947 SQL Developer                                                    WAITING PX Deq: Execution Msg           1492 5xyrp7v4djagv    33554434 26-OCT-2020 12:18:59         
         2       3095       3608 SQL Developer                                                    WAITING PX Deq: Execution Msg           1492 5xyrp7v4djagv    33554434 26-OCT-2020 12:18:59         
         2       3367      28066 SQL Developer                                                    WAITING PX Deq: Execution Msg           1492 5xyrp7v4djagv    33554434 26-OCT-2020 12:18:59         
         2       5610       8452 SQL Developer                                                    WAITING PX Deq: Table Q Normal          1492 5xyrp7v4djagv    33554434 26-OCT-2020 12:18:59         
         2       5885      54481 SQL Developer              9672          2       9672          2 WAITING PX Deq: Execute Reply           1492 5xyrp7v4djagv    33554434 26-OCT-2020 12:18:59         
         2       8828      57832 SQL Developer                                                    WAITING PX Deq: Execution Msg           1492 5xyrp7v4djagv    33554434 26-OCT-2020 12:18:59         
         2       9111      37143 SQL Developer                                                    WAITING PX Deq: Execution Msg           1492 5xyrp7v4djagv    33554434 26-OCT-2020 12:18:59         
         2       9383      50792 SQL Developer                                                    WAITING PX Deq: Execution Msg           1492 5xyrp7v4djagv    33554434 26-OCT-2020 12:18:59         
         2       9672      29993 SQL Developer                                                    WAITING PX Deq: Execution Msg           1492 5xyrp7v4djagv    33554434 26-OCT-2020 12:18:59         

Real-time SQL Monitoring report looks per below (note Duration compared to any other progress/metrics)… note that in later versions of Oracle, RTSM has a habit of timing out and reporting DONE(ERROR) even though the SQL is still going:



 Global Information
 ------------------------------
 Status              :  EXECUTING                  
 Instance ID         :  2                          
 SQL ID              :  5xyrp7v4djagv              
 SQL Execution ID    :  33554434                   
 Execution Started   :  10/26/2020 12:18:59        
 First Refresh Time  :  10/26/2020 12:18:59        
 Last Refresh Time   :  10/26/2020 12:18:59        
 Duration            :  1592s                      
 Module/Action       :  SQL Developer/-            
 Service             :  LNTDH8U.UK.DB.COM          
 Program             :  SQL Developer              


Global Stats
=========================================
| Elapsed |   Cpu   |  Other   | Buffer |
| Time(s) | Time(s) | Waits(s) |  Gets  |
=========================================
|    0.00 |    0.00 |     0.00 |      3 |
=========================================

Parallel Execution Details (DOP=4 , Servers Allocated=8)
==========================================================================================
|      Name      | Type  | Server# | Elapsed |   Cpu   |  Other   | Buffer | Wait Events |
|                |       |         | Time(s) | Time(s) | Waits(s) |  Gets  | (sample #)  |
==========================================================================================
| PX Coordinator | QC    |         |    0.00 |    0.00 |     0.00 |      3 |             |
| p000           | Set 1 |       1 |         |         |          |        |             |
| p001           | Set 1 |       2 |         |         |          |        |             |
| p002           | Set 1 |       3 |         |         |          |        |             |
| p003           | Set 1 |       4 |         |         |          |        |             |
| p004           | Set 2 |       1 |         |         |          |        |             |
| p005           | Set 2 |       2 |         |         |          |        |             |
| p006           | Set 2 |       3 |         |         |          |        |             |
| p007           | Set 2 |       4 |         |         |          |        |             |
==========================================================================================

SQL Plan Monitoring Details (Plan Hash Value=3844894891)
===============================================================================================================================================================================================
| Id   |                           Operation                           |              Name              |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Activity | Activity Detail |
|      |                                                               |                                | (Estim) |      | Active(s) | Active |       | (Actual) |   (%)    |   (# samples)   |
===============================================================================================================================================================================================
|    0 | SELECT STATEMENT                                              |                                |         |      |           |        |     1 |          |          |                 |
| -> 1 |   PX COORDINATOR                                              |                                |         |      |         1 |     +0 |     1 |        0 |   100.00 | Cpu (1)         |
|    2 |    PX SEND QC (RANDOM)                                        | :TQ20006                       |      2M |  445 |           |        |       |          |          |                 |
|    3 |     HASH JOIN                                                 |                                |      2M |  445 |           |        |       |          |          |                 |
|    4 |      PART JOIN FILTER CREATE                                  | :BF0000                        |     232 |  441 |           |        |       |          |          |                 |
|    5 |       PX RECEIVE                                              |                                |     232 |  441 |           |        |       |          |          |                 |
|    6 |        PX SEND BROADCAST                                      | :TQ20005                       |     232 |  441 |           |        |       |          |          |                 |
|    7 |         VIEW                                                  |                                |     232 |  441 |           |        |       |          |          |                 |
|    8 |          WINDOW CONSOLIDATOR BUFFER                           |                                |     232 |  441 |           |        |       |          |          |                 |
|    9 |           PX RECEIVE                                          |                                |     232 |  441 |           |        |       |          |          |                 |
|   10 |            PX SEND HASH                                       | :TQ20004                       |     232 |  441 |           |        |       |          |          |                 |
|   11 |             WINDOW BUFFER                                     |                                |     232 |  441 |           |        |       |          |          |                 |
|   12 |              HASH JOIN                                        |                                |     232 |  441 |           |        |       |          |          |                 |
|   13 |               NESTED LOOPS                                    |                                |     505 |    4 |           |        |       |          |          |                 |
|   14 |                TABLE ACCESS BY GLOBAL INDEX ROWID             | D_B                            |       1 |    3 |           |        |       |          |          |                 |
|   15 |                 PX RECEIVE                                    |                                |       1 |    2 |           |        |       |          |          |                 |
|   16 |                  PX SEND HASH (BLOCK ADDRESS)                 | :TQ20002                       |       1 |    2 |           |        |       |          |          |                 |
|   17 |                   PX SELECTOR                                 |                                |         |      |           |        |       |          |          |                 |
|   18 |                    INDEX UNIQUE SCAN                          | D_B_PK                         |       1 |    2 |           |        |       |          |          |                 |
|   19 |                INDEX FULL SCAN                                | D_R_UK01                       |     505 |    1 |           |        |       |          |          |                 |
|   20 |               PX RECEIVE                                      |                                |     108 |  437 |           |        |       |          |          |                 |
|   21 |                PX SEND BROADCAST                              | :TQ20003                       |     108 |  437 |           |        |       |          |          |                 |
|   22 |                 VIEW                                          | V_V2                           |     108 |  437 |           |        |       |          |          |                 |
|   23 |                  WINDOW CONSOLIDATOR BUFFER                   |                                |     108 |  437 |           |        |       |          |          |                 |
|   24 |                   PX RECEIVE                                  |                                |     108 |  437 |           |        |       |          |          |                 |
|   25 |                    PX SEND HASH                               | :TQ20001                       |     108 |  437 |           |        |       |          |          |                 |
|   26 |                     WINDOW SORT                               |                                |     108 |  437 |           |        |       |          |          |                 |
|   27 |                      NESTED LOOPS                             |                                |     108 |  436 |           |        |       |          |          |                 |
|   28 |                       NESTED LOOPS                            |                                |     108 |    4 |           |        |       |          |          |                 |
|   29 |                        TABLE ACCESS BY GLOBAL INDEX ROWID     | D_B                            |       1 |    3 |           |        |       |          |          |                 |
|   30 |                         PX RECEIVE                            |                                |       1 |    2 |           |        |       |          |          |                 |
|   31 |                          PX SEND HASH (BLOCK ADDRESS)         | :TQ20000                       |       1 |    2 |           |        |       |          |          |                 |
|   32 |                           PX SELECTOR                         |                                |         |      |           |        |       |          |          |                 |
|   33 |                            INDEX UNIQUE SCAN                  | D_B_PK                         |       1 |    2 |           |        |       |          |          |                 |
|   34 |                        INDEX RANGE SCAN                       | D_BS_UK01                      |     108 |    1 |           |        |       |          |          |                 |
|   35 |                       PX COORDINATOR                          |                                |         |      |           |        |       |          |          |                 |
|   36 |                        PX SEND QC (RANDOM)                    | :TQ10002                       |       1 |    4 |           |        |       |          |          |                 |
|   37 |                         BUFFER SORT                           |                                |      2M |      |           |        |       |          |          |                 |
|   38 |                          VIEW                                 | V_V1                           |       1 |    4 |           |        |       |          |          |                 |
|   39 |                           UNION ALL PUSHED PREDICATE          |                                |         |      |           |        |       |          |          |                 |
|   40 |                            TABLE ACCESS BY GLOBAL INDEX ROWID | D_S                            |       1 |    2 |           |        |       |          |          |                 |
|   41 |                             BUFFER SORT                       |                                |         |      |           |        |       |          |          |                 |
|   42 |                              PX RECEIVE                       |                                |       1 |    1 |           |        |       |          |          |                 |
|   43 |                               PX SEND HASH (BLOCK ADDRESS)    | :TQ10000                       |       1 |    1 |           |        |       |          |          |                 |
|   44 |                                PX SELECTOR                    |                                |         |      |           |        |       |          |          |                 |
|   45 |                                 INDEX UNIQUE SCAN             | D_S_PK                         |       1 |    1 |           |        |       |          |          |                 |
|   46 |                            TABLE ACCESS BY INDEX ROWID        | D_S                            |       1 |    2 |           |        |       |          |          |                 |
|   47 |                             BUFFER SORT                       |                                |         |      |           |        |       |          |          |                 |
|   48 |                              PX RECEIVE                       |                                |       1 |    1 |           |        |       |          |          |                 |
|   49 |                               PX SEND HASH (BLOCK ADDRESS)    | :TQ10001                       |       1 |    1 |           |        |       |          |          |                 |
|   50 |                                PX SELECTOR                    |                                |         |      |           |        |       |          |          |                 |
|   51 |                                 INDEX UNIQUE SCAN             | D_S                            |       1 |    1 |           |        |       |          |          |                 |
|   52 |      PX BLOCK ITERATOR ADAPTIVE                               |                                |    8255 |    3 |           |        |       |          |          |                 |
|   53 |       TABLE ACCESS STORAGE FULL                               | F_FACT                         |    8255 |    3 |           |        |       |          |          |                 |
===============================================================================================================================================================================================

ASH says it’s not doing anything (as we’d expect it to, being consistent with RTSM report above).


select count(*), min(sample_time), max(sample_time), sysdate from gv$active_session_history where sql_id = '5xyrp7v4djagv';

  COUNT(*) MIN(SAMPLE_TIME)             MAX(SAMPLE_TIME)             SYSDATE             
---------- ---------------------------- ---------------------------- --------------------
         1 26-OCT-20 12.18.59.828000000 26-OCT-20 12.18.59.828000000 26-OCT-2020 12:50:29

Miscellaneous Slave Info


select
decode(px.qcinst_id,NULL,'QC',
' - '||lower(substr(pp.SERVER_NAME,
length(pp.SERVER_NAME)-4,4) ) )"Username",
decode(px.qcinst_id,NULL, 'QC', '(Slave)') "QC/Slave" ,
to_char( px.server_set) "SlaveSet",
to_char(s.sid) "SID",
to_char(px.inst_id) "Slave INST",
decode(px.qcinst_id, NULL ,to_char(s.sid) ,px.qcsid) "QC SID",
to_char(px.qcinst_id) "QC INST",
px.req_degree "Req. DOP",
px.degree "Actual DOP"
from gv$px_session px,
gv$session s ,
gv$px_process pp
where px.sid=s.sid (+)
and px.serial#=s.serial#(+)
and px.inst_id = s.inst_id(+)
and px.sid = pp.sid (+)
and px.serial#=pp.serial#(+)
and s.sql_id = '5xyrp7v4djagv'
order by 6 , 1 desc
/

Usernam QC/Slav SlaveSet   SID       Slave INST QC SID     QC INST      Req. DOP  Actual DOP
------- ------- ---------- --------- ---------- ---------- -------- ------------- -----------
QC      QC                 5885       2          5885         
 - p007 (Slave) 2          9672       2          5885       2                   4          4
 - p006 (Slave) 2          3367       2          5885       2                   4          4
 - p005 (Slave) 2          9395       2          5885       2                   4          4
 - p004 (Slave) 2          3095       2          5885       2                   4          4
 - p003 (Slave) 1          9111       2          5885       2                   4          4
 - p002 (Slave) 1          2805       2          5885       2                   4          4
 - p001 (Slave) 1          8828       2          5885       2                   4          4
 - p000 (Slave) 1          5610       2          5885       2                   4          4

Upgrade 11.2.0.3 Days 1 and 2

AKA being knee deep in …

Go-live at the weekend from 9i Solaris to 11.2.0.3 Linux.

There were a couple of minor issues on arrival Monday morning but first critical situation started @ 8:29 am Monday morning and is still ongoing Tuesday evening.

More details later but key points are:
1. Degradation of various durations, longest so far about two hours.
2. Three deliberate bounces, one crash.
3. Unrelated to above, ongoing ORA-00600s at a rate of a couple per hour, ongoing ORA-07445s at a peak rate of about 4 per minute.

Technical considerations from above:
1. Library Cache : Mutex X Contention and change to DBMS_ALERT.REGISTER
If you make significant use of DBMS_ALERT then beware a signature change in 11.2 to DBMS_ALERT.REGISTER which relates to cleanup of orphaned pipes, the default value of which is TRUE.

This new cleanup behaviour, introduced due to some other bug that I don’t have detail of right now, is apparently quite an expensive operation and in practice for us, this has resulted in critical hanging scenarios on “library cache: mutex x” effectively bringing the db to its knees.

The key recursive SQL that seems to be at the heart of this mutex issue is:

SELECT DISTINCT SUBSTR(KGLNAOBJ,11) SID FROM X$KGLOB WHERE KGLHDNSP = 7 AND KGLNAOBJ LIKE 'ORA$ALERT$%' AND BITAND(KGLHDFLG,128)!=0 UNION SELECT DISTINCT SID FROM DBMS_ALERT_INFO

Changing calls to DBMS_ALERT.REGISTER to make an explicit FALSE value for cleanup is currently the advised approach but raises the question of how you should deal with the cleanup that you’d think must be required if it was introduced as default behaviour.

At the moment, I am unconvinced by the declaration that this is expected behaviour of DBMS_ALERT.REGISTER. An expensive operation is not one that causes a massive chain of “library cache:mutex x” waits for 2 hours until the database is effectively down and has to be bounced. That smacks of bug to me.

Observations from the issue – firstly that 11.2.0.3 seems to be reporting mutex waits better than previous versions but whilst all blocking indicators (in ASH, V$SESSION and V$MUTEX_SLEEP_HISTORY) were clear about who they thought was the suspect – i.e. they all identified the same session – that session itself was waiting on the same mutex with a blocking_session of UNKNOWN. And killing that session at the head of the chain just resulted in someone higher up the chain taking their place and everyone else pointing the finger at the new #1.

2. ORA-00600 and ORA-07445s
We have multiple ORA-00600 and ORA-07445 errors, most of which seem to be related to the same area – application contexts and the introduction of the caching of instantiation objects (iobs) for application contexts in order to reduce library cache mutex contention. As mentioned this has caused one crash from PMON which inevitably crashed the instance and a steady stream of application errors. Not all the below are related to the same issue – those that are will “probably” be fixed by the same patch – but this is the current list of ongoing:

ORA-600: [kgldafr-bad-ext]
ORA-600: [kglbrk-bad-lock]
ORA-07445: [kglbrk()+45]
ORA-07445: [kglLockHandle()+23]
ORA-07445: [kksampp()+97]
ORA-07445: [qertbStart()+119]
ORA-07445: [keswxCurPushPlanMonitoring()+510]
ORA-07445: [ktspsrchsc_free()+321]

Just a brief summary for visibility, more details another time…

As I was coming in on Monday morning, I thought that maybe I was being too pessimistic with my expectations and that it might turn out to be a quiet day. Tuns out I wasn’t pessimistic enough. And we spent months testing this…

Fun, fun…

Upgrade update

Finally, my client’s 11g upgrade is approaching.

It feels like it’s been imminent for ages as we slipped from February 2011 to September then through October to November 2011, February 2012, March 10 2012 and latterly to March 25 2012.

On the whole, these slippages have been more to do with affairs of big, corporate IT rather than any issues with the testing of the application under 11g, for example, red tape, higher priority changes and procurement for the whole 11g development estate.

When we actually go live, I’d be surprised if we don’t see some significant issues and oversights in a relatively short timeframe.

However, before then I just want to go through a quick whirlwind of some of the highlights and challenges undertaken/encountered:

Migration from 9.2.0.8 on Solaris to 11.2.0.3 on Linux via an in situ upgrade to 10g on Solaris as an interim step to faciliate use of datapump as opposed to imp/exp
– Change of hardware.
– Change of OS.
– Change of db version.
– We do not use “incremental change”.

Moving from a 20G sga with 17G buffer cache to a 70G sga and a minimum 25G buffer cache – in reality nearer 60G – using sga_target and db_cache_size.

Not using AMM (memory_target) yet not using hugepages!

Async and concurrent IO.

Moving from a statistics strategy of ‘FOR ALL INDEXED COLUMNS SIZE AUTO’ fixed with the two-monthly release cycle to the default stats job (running in a weekend window).

Removal of a weekend maintenance job that rebuilt lots of indexes for undocumented historical reasons.

Application issues:

No (very limited) use of sql plan baselines.

Trial of cursor_sharing = force for modules with poor shared sql principles, abandonment of trial on 11.2.0.2 due to ORA-07445/ORA-00600 from pro*c code.

Manual tuning of any code that exhibited any performance degradation – plenty of that (not surprising given that the oldest comments in the code date back to 1992).

Bugs/Patches applied initially on 11.2.0.2:
11719151 – crippling sql plan management slowness
9842771 – wrong sreadtim, mreadtim statistics
10269193 – wrong results with outer join and case expression
9877980 – issues with cursor_sharing = force

then because of an XSLT bug (10390389) and because of the number of other bugs listed as fixed
11.2.0.3 (which incorporates patches 11719151 and 9877980 previously applied on top of 11.2.0.2)

Test config changes for Veritas CIO and Linux IO Scheduler

Just a short note to report on the impact that some config changes have made on IO times in a specific environment for a specific workload.

I mentioned previously that I’m working on an upgrade of an application from 9.2.0.8 to 11.2.0.2

  • There are changes in pretty much every area – new hardware, different OS, etc, etc.
  • We’re using a full-volume UAT environment within the new set-up to compare new against old (production) and that will form the main basis for performance changes to the application required for this upgrade.
  • It’s pretty much an apples vs oranges comparison and not helped by the fact that UAT runs on tier 2 storage to be compared against the current tier 1 storage – UAT IO is slow.
  • In summary, not exactly best practice upgrade approach – but that’s just how it is sometimes in the real world.

Anyway, anyway, anyway… we’ve been waiting for recommendations and official go-ahead from the database engineering group who run the tests and control the builds of the machines and the following config changes have been made to the following:

  • Veritas VxFS CIO
  • Linux Deadline scheduler

Ideally such changes would be made individually to gauge their individual impact, however, as mentioned, it’s not always like that is it?.

And on the UAT environments above, based on a before-flashback-after run of the main application workload, the following IO times were observed:

Wait Event Average wait time (before) Average wait time (after)
db file sequential read 10 ms 7 ms
db file scattered read 21 ms 11 ms
db file parallel read 38 ms 60 ms
direct path read 95 ms 64 ms
direct path read temp 32 ms 9 ms
direct path write temp 34 ms 8 ms
log file sync 9 ms 3 ms

I’m not convinced that we have a level of control over the whole environment and time to deliver change incrementally to read too much into a single before/after comparison of the same workload, however these initial findings were better than I expected.
(Bottom line is that it’s still apples vs oranges)

So… it may be that someone finds this useful.

Upgrade experiences

I’m planning for an Oracle upgrade, 9i to 11gR2. Cross-platform – Sun Solaris to Linux. Different Endian.

It’s just under 1TB so not big these days.

Downtime shouldn’t be too much of an issue. It’s not like it needs to be done in an hour or anything demanding like that. As long as it can be done within a day.

Within the client, there’s no official build for 11g on Solaris, so I think that might rule out solutions like an in situ upgrade to 11g on Solaris followed immediately by the move & conversion to Linux.

Imp/Exp seems to be the recommended approach given all of the above.

There are lots of resources out there to help with an upgrade.

So, no shortage of advice and guidance.

But I’m also interested in hearing some real world experiences.

So, if you’ve been through something similar, please let me know.

Tell me how you did it and how it went.

11g upgrade still good

Still looking good.

CPU on the new box is hovering under the 20% mark having been habitually > 80% on the old kit. But it could be a quiet time for the business so we will have to compare the metrics that we capture as a proxy for “business activity”.

Best news is that we were getting a significant number of “ORA-01801: date format too long for internal buffer” errors every day causing the client-server application to crash.

These have been due to cursors being shared when they shouldn’t be, for example due to NLS mismatches.

As hoped (and promised to the business), these have disappeared with the upgrade.

we have 11g lift off

Production database was upgraded to 11g over the weekend: 9i -> 11g, SE -> EE, Windows -> Linux.

This is the first application in this client’s company to upgrade.

Seems to have gone well given that I was not called over the weekend.

As ever though, the only true test is going live and I would be very, very surprised if there were no suprises!

So, in the absence of any ringing hotlines and before I start on something new later this morning, I’m just going to be monitoring active session and also seeing what rises to the top by looking at sql with greater than average buffer_gets (I know roughly what’s normally there so will be interesting to see if anything changes).