Online Statistics Gathering
April 8, 2021 1 Comment
Online Statistics Gathering, https://sqlmaria.com/2017/01/03/online-statistics-gathering/, is a pretty cool feature for the ages-old version that is Oracle 12.1, especially data warehouses.
As a recap, when you do direct path load into an empty segment, Oracle will take advantage of gathering statistics on the data whilst it is loading it, i.e no need to go back and re-read in an explicit stats gathering.
Note that, per Maria’s note, only base table and column statistics are gathered, not histograms but those can be gathered without re-gathering base column statistics via GATHER AUTO.
Old news.
We have only recently upgraded a complex application from 11.2.0.4 and hit an issue which is really obvious in hindsight but just worth pointing out, particular in combination with another feature?
I’m sure the behaviour below will come as no surprise in 11.2.0.4
drop table t1;
create global temporary table t1
on commit preserve rows
as
select * from user_objects where rownum <= 100;
explain plan for
select * from t1;
select * from table(dbms_xplan.display);
select num_rows from user_tables where table_name = 'T1';
select column_name, num_nulls, num_distinct from user_tab_columns where table_name = 'T1';
The output of note:
Plan hash value: 3617692013
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 19000 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS STORAGE FULL | T1 | 100 | 19000 | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement (level=2)
NUM_ROWS
----------
COLUMN_NAME NUM_NULLS NUM_DISTINCT
-------------------- ---------- ------------
GENERATED
SECONDARY
NAMESPACE
EDITION_NAME
OBJECT_NAME
SUBOBJECT_NAME
OBJECT_ID
DATA_OBJECT_ID
OBJECT_TYPE
CREATED
LAST_DDL_TIME
TIMESTAMP
STATUS
TEMPORARY
We load 100 rows, there are no stats, dynamic sampling kicks in, bingo the optimizer expects 100 rows.
Then for > 11.2.0.4 ( in this case 19.6), with the addition of SCOPE column in the script for 19c, we see that we now have table and column statistics after the load despite no explicit stats gathering call. Which, given the feature we’re discussing, should be completely obvious and expected.
I’ve used a temporary table here deliberately purely to facilitate a comment at the end. Clearly this could be a normal heap table – but the SCOPE in the output is specific to the new feature of temporary table private statistics.
Plan hash value: 3617692013
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 11300 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS STORAGE FULL | T1 | 100 | 11300 | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------
Note
-----
- Global temporary table session private statistics used
NUM_ROWS SCOPE
---------- -------
SHARED
100 SESSION
COLUMN_NAME NUM_NULLS NUM_DISTINCT SCOPE
--------------------- ---------- ------------ -------
OBJECT_NAME 0 7 SESSION
SUBOBJECT_NAME 7 93 SESSION
OBJECT_ID 0 100 SESSION
DATA_OBJECT_ID 3 97 SESSION
OBJECT_TYPE 0 3 SESSION
CREATED 0 95 SESSION
LAST_DDL_TIME 0 97 SESSION
TIMESTAMP 0 96 SESSION
STATUS 0 1 SESSION
TEMPORARY 0 1 SESSION
GENERATED 0 2 SESSION
SECONDARY 0 1 SESSION
NAMESPACE 0 1 SESSION
EDITION_NAME 100 0 SESSION
SHARING 0 1 SESSION
EDITIONABLE 100 0 SESSION
ORACLE_MAINTAINED 0 1 SESSION
APPLICATION 0 1 SESSION
DEFAULT_COLLATION 94 1 SESSION
DUPLICATED 0 1 SESSION
SHARDED 0 1 SESSION
CREATED_APPID 100 0 SESSION
CREATED_VSNID 100 0 SESSION
MODIFIED_APPID 100 0 SESSION
MODIFIED_VSNID 100 0 SESSION
Again, we’ve loaded 100 rows, as part of the direct path load, we have some base & column statstics, the optimizer expects 100 rows. No dynamic sampling necessary
But what if our data is loaded in multiple steps, might this new online statistics gathering cause us a bit of a problem?
It only gathers the stats when the segment is empty.
I’m not going to repeat the output above for the table & column statistics, just the execution plan detail.
drop table t1;
create global temporary table t1
on commit preserve rows
as
select * from user_objects where rownum <= 100;
insert /*+ append */ into t1
select * from user_objects
commit;
insert /*+ append */ into t1
select * from user_objects
commit;
explain plan for
select * from t1;
Which in 11.2.0.4 gives:
Plan hash value: 3617692013
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 32376 | 6007K | 93 (0)| 00:00:01 |
| 1 | TABLE ACCESS STORAGE FULL | T1 | 32376 | 6007K | 93 (0)| 00:00:01 |
---------------------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement (level=2)
select count(*) from t1;
COUNT(*)
----------
34392
And in 19.6
Plan hash value: 3617692013
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 11300 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS STORAGE FULL | T1 | 100 | 11300 | 2 (0)| 00:00:01 |
----------------------------------------------------------------------------------
Note
-----
- Global temporary table session private statistics used
select count(*) from t1;
COUNT(*)
----------
59712*
*With an obvious difference in actual rows, because these are different environments on different versions.
The problem now is that our subsequent processing code which drives off a table populated with different volumes of data in multiple stages used to work fine in 11.2.0.4 but is now at risk of some serious consequences from having cardinality mis-estimates thanks to inaccurate statistics from our new feature which is behaving as it should.
In our real-world example, it was not always the case that there was a big skew between the first load and the subsequent loads and actually this consequence, obvious in hindsight, took a few weeks to reveal itself until that first load into an empty segment happened to be very small followed by some other loads which were much larger.
Furthermore, this also involved the private temporary table aspect and so when you’re looking at solutions, you might consider whether some explicit stats gathering at the end of the load is the solution – in which case should it be private stats or shared stats or whether it even matters when quite possibly the easiest solution is to just use dynamic sampling in the subsequent select/s.
I was reminded of this issue when writing this other article earlier on a more serious problem that online statistics gathering was causing:
Recent Comments