Upgrade 11.2.0.3 Days 1 and 2

AKA being knee deep in …

Go-live at the weekend from 9i Solaris to 11.2.0.3 Linux.

There were a couple of minor issues on arrival Monday morning but first critical situation started @ 8:29 am Monday morning and is still ongoing Tuesday evening.

More details later but key points are:
1. Degradation of various durations, longest so far about two hours.
2. Three deliberate bounces, one crash.
3. Unrelated to above, ongoing ORA-00600s at a rate of a couple per hour, ongoing ORA-07445s at a peak rate of about 4 per minute.

Technical considerations from above:
1. Library Cache : Mutex X Contention and change to DBMS_ALERT.REGISTER
If you make significant use of DBMS_ALERT then beware a signature change in 11.2 to DBMS_ALERT.REGISTER which relates to cleanup of orphaned pipes, the default value of which is TRUE.

This new cleanup behaviour, introduced due to some other bug that I don’t have detail of right now, is apparently quite an expensive operation and in practice for us, this has resulted in critical hanging scenarios on “library cache: mutex x” effectively bringing the db to its knees.

The key recursive SQL that seems to be at the heart of this mutex issue is:

SELECT DISTINCT SUBSTR(KGLNAOBJ,11) SID FROM X$KGLOB WHERE KGLHDNSP = 7 AND KGLNAOBJ LIKE 'ORA$ALERT$%' AND BITAND(KGLHDFLG,128)!=0 UNION SELECT DISTINCT SID FROM DBMS_ALERT_INFO

Changing calls to DBMS_ALERT.REGISTER to make an explicit FALSE value for cleanup is currently the advised approach but raises the question of how you should deal with the cleanup that you’d think must be required if it was introduced as default behaviour.

At the moment, I am unconvinced by the declaration that this is expected behaviour of DBMS_ALERT.REGISTER. An expensive operation is not one that causes a massive chain of “library cache:mutex x” waits for 2 hours until the database is effectively down and has to be bounced. That smacks of bug to me.

Observations from the issue – firstly that 11.2.0.3 seems to be reporting mutex waits better than previous versions but whilst all blocking indicators (in ASH, V$SESSION and V$MUTEX_SLEEP_HISTORY) were clear about who they thought was the suspect – i.e. they all identified the same session – that session itself was waiting on the same mutex with a blocking_session of UNKNOWN. And killing that session at the head of the chain just resulted in someone higher up the chain taking their place and everyone else pointing the finger at the new #1.

2. ORA-00600 and ORA-07445s
We have multiple ORA-00600 and ORA-07445 errors, most of which seem to be related to the same area – application contexts and the introduction of the caching of instantiation objects (iobs) for application contexts in order to reduce library cache mutex contention. As mentioned this has caused one crash from PMON which inevitably crashed the instance and a steady stream of application errors. Not all the below are related to the same issue – those that are will “probably” be fixed by the same patch – but this is the current list of ongoing:

ORA-600: [kgldafr-bad-ext]
ORA-600: [kglbrk-bad-lock]
ORA-07445: [kglbrk()+45]
ORA-07445: [kglLockHandle()+23]
ORA-07445: [kksampp()+97]
ORA-07445: [qertbStart()+119]
ORA-07445: [keswxCurPushPlanMonitoring()+510]
ORA-07445: [ktspsrchsc_free()+321]

Just a brief summary for visibility, more details another time…

As I was coming in on Monday morning, I thought that maybe I was being too pessimistic with my expectations and that it might turn out to be a quiet day. Tuns out I wasn’t pessimistic enough. And we spent months testing this…

Fun, fun…

17 Responses to Upgrade 11.2.0.3 Days 1 and 2

  1. Dom Brooks says:

    Bottom line – patch released, applied, database has since been stable for two weeks.

    Exactly what the patch fixed is unclear – there were definitely fixes to internal code related to the caching of instantiation objects for application contexts.

    To what extent – if at all – the cleanup behaviour of DBMS_ALERT.REGISTER was involved is uncertain but on Oracle’s recommended we changed all calls to this to use FALSE for the default TRUE argument for clean-up.

    At least one of the ORA-00600s related directly to one of the library cache mutex x hang scenarios but this was not the case for all. In this particular case, the reporting of blocked/blocker was correct. In the other hang scenarios it was not.

    • oakesgr says:

      Hi Dom,

      Did you get any idea from oracle about how this didn’t crop up during testing? I’m now a very worried dba, knowing that I’ve got some 9i solaris -> 11g Linux migrations coming up.

      Cheers
      Graham

      • Dom Brooks says:

        To be honest Graham, if there was one area where we knew we were short of testing it was in a really representative intra-day load. We did not have the infrastructure nor the license for RAT nor the space/policital will to gather the production trace information to use free RAT-like alternatives.

        However, whilst that first week was a bit hectic, Oracle Support were pretty impressive and whilst there also was a myriad of different ORA-00600s and 07445s, that one patch has addressed them and everything is now quiet (too quiet to be honest) and performance is massively up (although we put in a fair bit of effort during testing to achieve that).

        This application makes a lot of use of quite specific functionality – DBMS_ALERT and application contexts – and this one patch addressed issues in both those area. Certainly if you use either of those then this might be something to be aware of.

        Bottom line – if we’d been able to fully and accurately simulate the intra-day load then we may have got this in testing. There were plenty of other things that we did pick up on during testing, just not this.

        But it’s big IT and Oracle faces the same challenges that we all do these days, for whatever reason, to create quality software. At some point there will be bugs, it’s inevitable, no matter how much testing you do.

  2. Pingback: Why did that report run slow? ASH says log file sequential read « OraStory

  3. Ed says:

    Hi Dom,

    we’re going from 10.2.0.4 (works just fine) to 11.2.0.3 (working horribly in testing). Like you said, we don’t have RAT and can only get a few users on.

    I’ve not seen mutex waits but simple commands like one I tried today, select count from a table never finished and when I ran it again, having killed the process, it gave the result instantly.

    Can you tell me which patch you applied, please?

    Also, did you bring 9i outlines in to maintain plan stability?

    Thanks,
    Ed.

    • Dom Brooks says:

      Hi Ed,

      Sorry to hear of your troubles.

      I didn’t use outlines for stability although it is one of the recommended approaches in the whitepapers – one of the main reasons being that I knew that this appication’s performance in 9i was rubbish anyway and with a bit of manual tuning to prevent a few degradations, the application has seen a broad 33% performance improvement across the board.

      The patch applied was specific to our issue – the symptoms were quite specific as were the causes so I’d be very reluctant to suggest it in the absence of those.

      In terms of general slowness, I’d suggest taking some AWR, ASH & ADDM reports keeping the timeframe pretty small, also mine the raw ASH data (assuming you’re licensed otherwise statspack).

      If you have specific queries / examples of slowness then trace them (old-fashioned event 10046 at level 8 at least) or use real time sql monitoring (licence caveat).

      If you want a second opinion on the AWR reports then I’m happy to have a look – just email me at dombrooks@hotmail.com if you’re permitted to do this.

      There may be a general issue, it’s also possible that there are one or two key queries whose performance is impacting others.

      Cheers,
      Dominic

      • Ed says:

        Thanks very much for that, Dominic. Things have improved somewhat. I noticed in ADDM that there was significant virtual memory paging, which seemed unlikely, since I’ve given the instance a 6Gb SGA and I was the only one on the system.

        I did some delving and it turns out that up to 11.2.0.2 there is a bug with ADDM reporting this and it can be ignored. However, we’re on 11.2.0.3.
        Having eliminated that I wondered if it might be that we haven’t implemented hugepages. So I asked our Linux admin to do that and I’ve not the seen the problem since.

        The SQL plans seem to have come in for all our reports and it’s actually using them :). However, I had to do ALTER SYSTEM SET “_optimizer_use_feedback”=FALSE; as per Oracle.
        Apparently there’s a bug with cardinality feedback. We were finding that processes were hitting 100% CPU and never finishing.

        Cheers,
        Ed.

      • Dom Brooks says:

        HI Ed,

        Thanks for the comments.
        I witnessed that ADDM bug on 11.2.0.2 but have not seen it on 11.2.0.3.
        We’re not using hugepages – I lost that argument.

        If you have any more details on the cardinality feedback issues then please tell me more.
        I have seen issues with this and have had to turn off at a statement level for a few statements and am contemplating for a few more.

        Thanks.

        Dominic

      • Dom Brooks says:

        I see the following 11.2.0.3 bugs:
        Bug 13382280 Stopkey optimization causes error in cardinality feedback values
        Bug 13648166 Cardinality Feedback mismanagement of GROUP BY cardinality

  4. Ed says:

    Dominic,

    just wondered how 11.2 is behaving now? Did you get all the bugs fixed?

    We’ve got one more week of testing and then it’s the go-live.

    Since I last updated, I’ve found that setting optimizer_features_enable back to 10.2.0.4 has helped greatly. In the beginning I’d rather have a stable system than one that uses all the features.

    I’m also keeping compatible at 10.2.0.4 for a while yet.

    We’ve not been able to properly load test, either. We have about 300 concurrent users. Maximum in testing… 30.

    Thanks,

    Ed.

    • Dom Brooks says:

      Three months on, performing very well.
      The first week was a nightmare but once the patch for the identified bug/s was applied everything settled down.

  5. Danny says:

    Just out of curiosity… What was the patchset id?

  6. Pingback: Reducing “library cache: mutex X” concurrency with dbms_shared_pool.markhot() « Julian Dontcheff's Database Blog

  7. Pingback: Using DBMS_ALERT for signaling events between sessionsitle (optional) | Ronald's Oracle

  8. Pingback: Ronald's Oracle

  9. Pingback: Using DBMS_ALERT for signaling events between sessions | Ronald's Oracle

Leave a reply to Danny Cancel reply