Upgrade 11.2.0.3 Days 1 and 2

AKA being knee deep in …

Go-live at the weekend from 9i Solaris to 11.2.0.3 Linux.

There were a couple of minor issues on arrival Monday morning but first critical situation started @ 8:29 am Monday morning and is still ongoing Tuesday evening.

More details later but key points are:
1. Degradation of various durations, longest so far about two hours.
2. Three deliberate bounces, one crash.
3. Unrelated to above, ongoing ORA-00600s at a rate of a couple per hour, ongoing ORA-07445s at a peak rate of about 4 per minute.

Technical considerations from above:
1. Library Cache : Mutex X Contention and change to DBMS_ALERT.REGISTER
If you make significant use of DBMS_ALERT then beware a signature change in 11.2 to DBMS_ALERT.REGISTER which relates to cleanup of orphaned pipes, the default value of which is TRUE.

This new cleanup behaviour, introduced due to some other bug that I don’t have detail of right now, is apparently quite an expensive operation and in practice for us, this has resulted in critical hanging scenarios on “library cache: mutex x” effectively bringing the db to its knees.

The key recursive SQL that seems to be at the heart of this mutex issue is:

SELECT DISTINCT SUBSTR(KGLNAOBJ,11) SID FROM X$KGLOB WHERE KGLHDNSP = 7 AND KGLNAOBJ LIKE 'ORA$ALERT$%' AND BITAND(KGLHDFLG,128)!=0 UNION SELECT DISTINCT SID FROM DBMS_ALERT_INFO

Changing calls to DBMS_ALERT.REGISTER to make an explicit FALSE value for cleanup is currently the advised approach but raises the question of how you should deal with the cleanup that you’d think must be required if it was introduced as default behaviour.

At the moment, I am unconvinced by the declaration that this is expected behaviour of DBMS_ALERT.REGISTER. An expensive operation is not one that causes a massive chain of “library cache:mutex x” waits for 2 hours until the database is effectively down and has to be bounced. That smacks of bug to me.

Observations from the issue – firstly that 11.2.0.3 seems to be reporting mutex waits better than previous versions but whilst all blocking indicators (in ASH, V$SESSION and V$MUTEX_SLEEP_HISTORY) were clear about who they thought was the suspect – i.e. they all identified the same session – that session itself was waiting on the same mutex with a blocking_session of UNKNOWN. And killing that session at the head of the chain just resulted in someone higher up the chain taking their place and everyone else pointing the finger at the new #1.

2. ORA-00600 and ORA-07445s
We have multiple ORA-00600 and ORA-07445 errors, most of which seem to be related to the same area – application contexts and the introduction of the caching of instantiation objects (iobs) for application contexts in order to reduce library cache mutex contention. As mentioned this has caused one crash from PMON which inevitably crashed the instance and a steady stream of application errors. Not all the below are related to the same issue – those that are will “probably” be fixed by the same patch – but this is the current list of ongoing:

ORA-600: [kgldafr-bad-ext]
ORA-600: [kglbrk-bad-lock]
ORA-07445: [kglbrk()+45]
ORA-07445: [kglLockHandle()+23]
ORA-07445: [kksampp()+97]
ORA-07445: [qertbStart()+119]
ORA-07445: [keswxCurPushPlanMonitoring()+510]
ORA-07445: [ktspsrchsc_free()+321]

Just a brief summary for visibility, more details another time…

As I was coming in on Monday morning, I thought that maybe I was being too pessimistic with my expectations and that it might turn out to be a quiet day. Tuns out I wasn’t pessimistic enough. And we spent months testing this…

Fun, fun…

Upgrade update

Finally, my client’s 11g upgrade is approaching.

It feels like it’s been imminent for ages as we slipped from February 2011 to September then through October to November 2011, February 2012, March 10 2012 and latterly to March 25 2012.

On the whole, these slippages have been more to do with affairs of big, corporate IT rather than any issues with the testing of the application under 11g, for example, red tape, higher priority changes and procurement for the whole 11g development estate.

When we actually go live, I’d be surprised if we don’t see some significant issues and oversights in a relatively short timeframe.

However, before then I just want to go through a quick whirlwind of some of the highlights and challenges undertaken/encountered:

Migration from 9.2.0.8 on Solaris to 11.2.0.3 on Linux via an in situ upgrade to 10g on Solaris as an interim step to faciliate use of datapump as opposed to imp/exp
- Change of hardware.
- Change of OS.
- Change of db version.
- We do not use “incremental change”.

Moving from a 20G sga with 17G buffer cache to a 70G sga and a minimum 25G buffer cache – in reality nearer 60G – using sga_target and db_cache_size.

Not using AMM (memory_target) yet not using hugepages!

Async and concurrent IO.

Moving from a statistics strategy of ‘FOR ALL INDEXED COLUMNS SIZE AUTO’ fixed with the two-monthly release cycle to the default stats job (running in a weekend window).

Removal of a weekend maintenance job that rebuilt lots of indexes for undocumented historical reasons.

Application issues:

No (very limited) use of sql plan baselines.

Trial of cursor_sharing = force for modules with poor shared sql principles, abandonment of trial on 11.2.0.2 due to ORA-07445/ORA-00600 from pro*c code.

Manual tuning of any code that exhibited any performance degradation – plenty of that (not surprising given that the oldest comments in the code date back to 1992).

Bugs/Patches applied initially on 11.2.0.2:
11719151 – crippling sql plan management slowness
9842771 – wrong sreadtim, mreadtim statistics
10269193 – wrong results with outer join and case expression
9877980 – issues with cursor_sharing = force

then because of an XSLT bug (10390389) and because of the number of other bugs listed as fixed
11.2.0.3 (which incorporates patches 11719151 and 9877980 previously applied on top of 11.2.0.2)

Test config changes for Veritas CIO and Linux IO Scheduler

Just a short note to report on the impact that some config changes have made on IO times in a specific environment for a specific workload.

I mentioned previously that I’m working on an upgrade of an application from 9.2.0.8 to 11.2.0.2

  • There are changes in pretty much every area – new hardware, different OS, etc, etc.
  • We’re using a full-volume UAT environment within the new set-up to compare new against old (production) and that will form the main basis for performance changes to the application required for this upgrade.
  • It’s pretty much an apples vs oranges comparison and not helped by the fact that UAT runs on tier 2 storage to be compared against the current tier 1 storage – UAT IO is slow.
  • In summary, not exactly best practice upgrade approach – but that’s just how it is sometimes in the real world.

Anyway, anyway, anyway… we’ve been waiting for recommendations and official go-ahead from the database engineering group who run the tests and control the builds of the machines and the following config changes have been made to the following:

  • Veritas VxFS CIO
  • Linux Deadline scheduler

Ideally such changes would be made individually to gauge their individual impact, however, as mentioned, it’s not always like that is it?.

And on the UAT environments above, based on a before-flashback-after run of the main application workload, the following IO times were observed:

Wait Event Average wait time (before) Average wait time (after)
db file sequential read 10 ms 7 ms
db file scattered read 21 ms 11 ms
db file parallel read 38 ms 60 ms
direct path read 95 ms 64 ms
direct path read temp 32 ms 9 ms
direct path write temp 34 ms 8 ms
log file sync 9 ms 3 ms

I’m not convinced that we have a level of control over the whole environment and time to deliver change incrementally to read too much into a single before/after comparison of the same workload, however these initial findings were better than I expected.
(Bottom line is that it’s still apples vs oranges)

So… it may be that someone finds this useful.

Upgrade experiences

I’m planning for an Oracle upgrade, 9i to 11gR2. Cross-platform – Sun Solaris to Linux. Different Endian.

It’s just under 1TB so not big these days.

Downtime shouldn’t be too much of an issue. It’s not like it needs to be done in an hour or anything demanding like that. As long as it can be done within a day.

Within the client, there’s no official build for 11g on Solaris, so I think that might rule out solutions like an in situ upgrade to 11g on Solaris followed immediately by the move & conversion to Linux.

Imp/Exp seems to be the recommended approach given all of the above.

There are lots of resources out there to help with an upgrade.

So, no shortage of advice and guidance.

But I’m also interested in hearing some real world experiences.

So, if you’ve been through something similar, please let me know.

Tell me how you did it and how it went.

11g upgrade still good

Still looking good.

CPU on the new box is hovering under the 20% mark having been habitually > 80% on the old kit. But it could be a quiet time for the business so we will have to compare the metrics that we capture as a proxy for “business activity”.

Best news is that we were getting a significant number of “ORA-01801: date format too long for internal buffer” errors every day causing the client-server application to crash.

These have been due to cursors being shared when they shouldn’t be, for example due to NLS mismatches.

As hoped (and promised to the business), these have disappeared with the upgrade.

we have 11g lift off

Production database was upgraded to 11g over the weekend: 9i -> 11g, SE -> EE, Windows -> Linux.

This is the first application in this client’s company to upgrade.

Seems to have gone well given that I was not called over the weekend.

As ever though, the only true test is going live and I would be very, very surprised if there were no suprises!

So, in the absence of any ringing hotlines and before I start on something new later this morning, I’m just going to be monitoring active session and also seeing what rises to the top by looking at sql with greater than average buffer_gets (I know roughly what’s normally there so will be interesting to see if anything changes).

Upgrading to 11g

I feel very fortunate that my current client is planning to upgrade a significant production system to Oracle 11g very shortly – you can’t get to know something like this properly until you use it in anger and work through the problems in a time-constrained environment. And I get the impression that there are very few production 11g sites out there. So it will be a good experience and it will look good for me.

Some say that 11g is no more than 10gR3, nevertheless it’s a bold upgrade decision – we’ll be going from 9i Standard Edition to 11g Enterprise Edition and switching from Windows to Linux (with a delayed phase of RAC thrown in).

It’s a bit like going from a Ford Focus to a Ferrari (but I’ll keep my thoughts on the futility of certain aspects of this to the RAC rant).

The upgrade is scheduled for March or Aprial and, so far, things have looked relatively pain free. In terms of general performance, not unexpectedly, Oracle 11g on Linux seems significantly faster than 9i on Windows.

The changes we have had to make so far seem mostly to concern changes in XML behaviour (change that actually mostly happened in 10gR2, but we’ve skipped that generation).
For example:

  • In 9i, if you tried to create an XML element with a space in the tag name using XMLELEMENT, the space would be replaced with “_x0020_”. In 11i, the space is preserved as a space. I’m not convinced by this change in behaviour as an element surely cannot have a space in it’s name. Anyway, in those unwise places where spaces have been used, these tags will have to change in both our client and database code.
  • There has been a change in behaviour regarding dates within XMLELEMENTs. In 9i, a date without an explicit format would pick up the default NLS_DATE_FORMAT of the session. In 11g, the date format will be YYYY-MM-DD. So, to preserve our 9i behaviour, we need to TO_CHAR our dates in XML so that they pick up the session format.
  • When using XMLFOREST in 9i, empty columns would result in no tag being present in the resultant XML. Although deliberate, I think there’s something dodgy about this behaviour anyway. However, it’s changed slightly now we’re in 11g. Columns with not null columns will always produce a tag, even if empty (consider outer joins to tables with not null colums). Empty nullable columns continue to have their tags omitted. I’m not sure how we will address this in our particular situation (empty tags cause us issues if the client entity is not a nullable object, but if the tag is missing apparently the whole enigma is avoided. Somehow!). The solution certainly won’t involve changing not null columns to nullable, but, to limit changes in both client and database, it could involve using XSL to strip out empty tags.

I anticipate further issues to come out of the wash however I expect the majority of these to only reveal themselves post-production upgrade. There are bound to be some statements that degrade significantly but it might not be until the business do their daily thing day-in day-out that the most significant problems for them float to the surface (as opposed to things that are most obvious from automated load testing which I am always doubtful regarding the efficacy thereof).

Fighting off RAC

As mentioned, my current client is upgrading from 9i Standard Edition on Windows to 11g Enterprise Edition on Linux.

The timeframe for this upgrade is quite aggressive. While normal development continues at a blistering pace, automated regression testing began part-time in January for an upgrade originally in March but now re-arranged for April.

Also in January, I was told that it had been decided that we would be going live on a 2-node RAC cluster. Alarm bells rang, sirens went off, I was bewildered as to how such a decision could be made.

This was a decision that management had apparently made in conjunction with the DBAs, with RAC sold as a black box database clustering solution with no question marks over its suitability.

There’s no architecture department/function and in terms of Oracle and databases, it’s a role I have previously filled elsewhere. I immediately went into overdrive, briefing against the wisdom of going to RAC before it was too late.

For us, there are numerous issues that I see with RAC:

  • The first is the upgrade timeline. The proposal, in January, was that in March we would be changing hardware, changing OS, changing database version, changing database edition and now also moving to RAC. There are plenty of references out there that suggest RAC alone can take a few months to stabilise and to nail your proper configuration. For a business that was committed to a March (even April now) deadline, this seemed to be to be a big risk.
  • Secondly, I am far convinced about the suitability of RAC for our system. It’s one thing for senior management to upgrade the business criticality of a system and give the greenlight for some spending to upgrade the platform. It’s another thing to upgrade your Ford Focus to a Ferrari only to stay stuck in the same commuting traffic. That’s my analogy of our system – lots of bad SQL, lots of statements that cope with 1001 different inputs, far too many lookups in functions called from SQL. At the end of the day there’s no getting away from the fact that our system has something like an average of 8 concurrent requests. Historically, it has been maxed out or close to being maxed out on CPU because the SQL is bad.
  • More importantly, I remain very concerned about the suitability of RAC for the profile of our system. I think of RAC as suitable for high-end OLTP systems of for multiple systems with non-OLTP systems restricting themselves to one of the nodes (unless non-conflicting activity can be determined somehow and allocated to relevant nodes). I would probably categorise our single system as a hybrid system with a very small element of OLTP and a large element of reporting. Furthermore, a large proportion of that reporting revolves around the top X% of a single table. The RAC architecture is such that repeated requests for the same blocks will cause significant waits as the instances coordinate and pass the blocks back and forth.
  • The bottom line is that RAC is complex and there’s no getting away from the fact complex is bad. Complexity eventually comes back round and bites you. And complexity usually costs more.

For now, my briefing has been a partial success, although really the only point that had any effective leverage was that the the upgrade date was under threat. A 2-node RAC solution is currently on hold for the initial upgrade. The idea being that we upgrade to 11g on the new hardware and then we expand to a 2-node cluster once that initial upgrade phase is stable.

But, and here’s the stickler, the upgrade is going to be to a 1-node RAC “cluster”. I can understand that there is a once-in-a-blue-moon opportunity to go to the new hardware, but I don’t like the smell of a 1-node RAC cluster at all. For a start, there surely has to be some overhead to the RAC installation. But we’re not even testing that. But what’s more is the cost. There is a significant licensing overhead to RAC. It’s an overhead that surely means that going to a 2-node cluster is a formality regardless. I’m just glad that it’s not my name on the cheque. It’s one of those things where I just hope that someone down the line doesn’t say that we’re moving off Oracle because it’s so expensive.

A couple of other folks’ thoughts on issues with RAC:

Upgrade from 8i to 10g using Shareplex

Plans to upgrade a production database from Oracle 8i to 10g are gradually coming together.

The current situation is that there is a primary 8i database which replicates to a logical standby database using Quest’s Shareplex.

There will be 10g primary and standby databases on new hardware which will probably use Oracle DataGuard to facilitate the standby although Shareplex is not definitely out of the picture.

Because of the new standalone hardware, there is an opportunity to do a relatively simple “upgrade”, bringing the 10g databases through to production readiness whilst replicating production activity from the 8i live database. This is much less risky than upgrading the existing production databases using something like DBUA.

The replication allows testing to be done with “as production” data (subject to any small delay in replication), with accurate volumes and activity and therefore facilitating a good comparison of the output from the applications.

And significantly, the actual go-live becomes a flick of a “switch” (switching application datasources from the 8i address to the 10g address) and, fingers crossed, there is a very small window of downtime. For this to happen, the application drivers need to be upgraded to 10gR2 compatible versions in advance of the database upgrade, drivers being backward not forward compatible.

Below is a messy picture:

Oracle upgrade using Shareplex

What this is trying to say is that, as mentioned above, the current situation is that a live 8i database replicates to a logical standby 8i database. For the migration, a further node of Shareplex replication will be set to go from the standby 8i database to the primary 10g destination.

This should mean that the live database is isolated from the additional load and increased risks of the additional replication.

In broad terms, the replication from 8i to 10g involves restoring an 8i backup on the new hardware followed by an export and import into the 10g database and then a reconciliation of the replication queues.

New Year’s Resolutions II

At work, I am resolved to push through the upgrade from Oracle 8.1.7 to 10.2.x

The client currently has a live 8.1.7 instance on some pretty old hardware that uses Shareplex to replicate to a standby.

Any planned upgrade is going to have some teething problems and I’m sure every installation and every upgrade is unique.

I am encouraged by one case study experience of Database Specialists whilst being wary of some of the problems that others (can’t find the relevant articles at the moment) have experienced.

In terms of application SQL performance, I think that the key to the initial migration is going to be the use of Stored Outlines to preserve the current perceived dodgy optimizer stability. Once the upgrade has been implemented and is stable, then I think we can start to optimise the SQL and, for example,remove some of the push_subq hints from queries that have no subqueries….

Follow

Get every new post delivered to your Inbox.

Join 68 other followers