Correlation / causation
May 31, 2012 3 Comments
On Tuesday morning I walked into a right sh*t storm.
I hadn’t been called but on Monday night performance had been absolutely catastrophic – SLAs missed left, right and centre.
Straight into a meeting where the following observations were made:
- Performance was somewhat suboptimal the previous night
- Some long-running queries were observed that aren’t normally top of the lists, these were “uncontrolled” access via Excel or Access, not via the reporting Front End or other application processes
- When these queries stopped, performance returned to normal
So:
- Why had these queries brought the database to its knees?
- Was it stats and the stats job on Sunday night?
- Was it “plan flips”? (I hear a lot about “plan flips” and them being “bad” – do people really expect plan stability without plan stability features?)
- Was it cardinality feedback? (Admittedly this does seem to have caused us more positives than negatives particularly when working in conjunction with dynamic sampling when I’m not sure it should be).
At this point, I still hadn’t seen any evidence or symptoms of the issue, just hearsay.
So, what did the problem look like?
Below is a snippet from an AWR report from a “normal” period.
(It’s longer than usually advisable – I just want to show the symptoms over the problem period)
Platform CPUs Cores Sockets Memory(GB) -------------------------------- ---- ----- ------- ---------- Linux x86 64-bit 24 12 2 141.44 Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 6678 21-May-12 21:30:26 584 22.8 End Snap: 6682 21-May-12 23:30:33 565 24.6 Elapsed: 120.13 (mins) DB Time: 582.50 (mins) Foreground Wait Events Avg %Time Total Wait wait Waits % DB Event Waits -outs Time (s) (ms) /txn time -------------------------- ------------ ----- ---------- ------- -------- ------ db file sequential read 2,664,063 0 7,050 3 2.5 20.2 log file sync 1,027,185 0 3,501 3 1.0 10.0 direct path read 159,706 0 1,191 7 0.2 3.4 db file scattered read 61,649 0 284 5 0.1 .8 Background Wait Events Avg %Time Total Wait wait Waits % bg Event Waits -outs Time (s) (ms) /txn time -------------------------- ------------ ----- ---------- ------- -------- ------ log file parallel write 839,190 0 1,465 2 0.8 23.6 db file parallel write 551,198 0 674 1 0.5 10.9 Log archive I/O 21,277 0 311 15 0.0 5.0 log file sequential read 6,714 0 150 22 0.0 2.4 control file parallel writ 14,311 0 48 3 0.0 .8
and this was what had happened:
Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 7014 28-May-12 21:30:28 560 22.8 End Snap: 7018 28-May-12 23:30:01 538 23.7 Elapsed: 119.54 (mins) DB Time: 2,067.54 (mins) Foreground Wait Events Avg %Time Total Wait wait Waits % DB Event Waits -outs Time (s) (ms) /txn time -------------------------- ------------ ----- ---------- ------- -------- ------ log file sync 399,560 0 83,938 210 1.0 67.7 direct path read 123,631 0 15,639 126 0.3 12.6 db file sequential read 363,158 0 9,406 26 0.9 7.6 db file scattered read 93,140 0 4,083 44 0.2 3.3 Background Wait Events Avg %Time Total Wait wait Waits % bg Event Waits -outs Time (s) (ms) /txn time -------------------------- ------------ ----- ---------- ------- -------- ------ db file parallel write 394,366 0 8,048 20 1.0 37.6 log file parallel write 101,489 0 6,197 61 0.2 28.9 Log archive I/O 12,026 0 1,814 151 0.0 8.5 control file parallel writ 11,240 0 1,793 160 0.0 8.4 log file sequential read 9,587 0 1,179 123 0.0 5.5
So, hang on a second.
We think that a couple of long-running queries caused this devastation – log file syncs out to 70x our chosen “normal” above and no CPU issues? All IO basically choked.
Doesn’t make sense – wouldn’t that take some serious misconfiguration?
And yet, how could this be a storage issue if it all resolved itself when those long-running queries finished? It was this doubt that meant we spent longer than needed checking and double checking this box and this database before contacting the storage team.
Then to cut a long story short and to miss out a whole bunch of toing and froing and whys and wherefores… the storage team put the focus on another of our databases that uses the same disk groups.
That database didn’t seem to be doing very much:
Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 8271 28-May-12 21:00:33 71 .9 End Snap: 8273 28-May-12 23:00:17 72 .9 Elapsed: 119.74 (mins) DB Time: 155.20 (mins)
Although the few things it was doing were similarly slow:
Avg wait % DB Event Waits Time(s) (ms) time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- control file sequential read 272,003 6,778 25 72.8 System I/O db file sequential read 36,346 1,409 39 15.1 User I/O db file parallel read 1,127 797 707 8.6 User I/O log file sync 300 253 843 2.7 Commit DB CPU 72 .8
But, hang on a sec. What’s that:
Time Model Statistics Statistic Name Time (s) % of DB Time ------------------------------------------ ------------------ ------------ RMAN cpu time (backup/restore) 42,305.3 454.3
and
Background Wait Events Avg %Time Total Wait wait Waits % bg Event Waits -outs Time (s) (ms) /txn time -------------------------- ------------ ----- ---------- ------- -------- ------ RMAN backup & recovery I/O 1,166,261 0 72,242 62 2.65E+04 60.2
Turns out that it was an RMAN backup, running when it shouldn’t have been, on a 3.5T archive database on one of the other nodes of the cluster and which uses the same disk groups.
SQL> select to_char(start_time,'DD-MON-YYYY HH24:MI') 2 , to_char(end_time,'DD-MON-YYYY HH24:MI') 3 , input_bytes_display 4 , output_bytes_display 5 , time_taken_display 6 from v$rman_backup_job_details 7 where session_stamp = 784494089 8 order by start_time desc 9 / TO_CHAR(START_TIM TO_CHAR(END_TIME, INPUT_BYTES_DISPLAY OUTPUT_BYTES_DISPLAY TIME_TAKEN_DISPLAY ----------------- ----------------- -------------------- -------------------- -------------------- 28-MAY-2012 19:01 28-MAY-2012 23:47 2.99T 532.82G 04:46:30 SQL>
That correlation was the causation, not those queries and this backup finished just before those long-running queries finished and then everything was ok.
But the long-running queries were just a red-herring and another symptom of everything that was running on the live database was choking.
Sometimes it’s difficult to distinguish between cause and symptom.
Hi Dom,
I’ve recently investigated a performance problem which ressembles, in the reason that hurted the performance of the system, to this “correlation/causation” article
It is also very remarkable to see, how the average wait time of db file sequential read and direct path read wait events went from a very good 3 ms and 7 ms in normal situation to a catastrophic 26 ms and 126 ms respectively during RMAN backup.
Great warning to all of us Dom. It’s soooo easy to jump to the wrong conclusion just because you see a couple long running queries.
Thanks Graham.