ORM – like we’ve always said

Brian Oliver’s post (http://brianoliver.wordpress.com/2008/11/26/terracotta-chooses-oracle-technology-for-high-availability-and-performance – now removed for some reason) about Terracota has reminded that I meant to blog about this product last week.

It’s a long established behaviour that we database specialists do not have the fondest of spots for ORM tools such as Hibernate, the inefficient way in which they deal with the database and the horrendous SQL that they sometimes generate. (Some of my previous can be found here)

Last week I came across this article, encouragingly ;-) titled “Hibernate without Database Bottlenecks”.

The fact that products such as Terracotta exist and are becoming more popular is finally sort of proof of what we’ve always said – that the row-by-row processing of Hibernate doesn’t work very well in the database (where it’s important to be thinking in sets).

‘Nuff said.

P.S. If that article requires registration, then to summarise:


One of the most prevalent application architectures today is that of a stateless application that maps object data into the database to be stored in relational format, and Hibernate is the most popular way to perform this object-relational mapping. Applications are designed this way for two reasons. First, scalability at the database server is a known and tunable quantity. Second, availability of the database is much closer to “five nines” than that of the application server. Despite these reasons, the burden that shared Java state places on the database and on the application developer is very high. While Hibernate lessens the developer’s workload in having to interface to a database, Terracotta lessens Hibernate’s need to depend on the database for availability and scalability in the first place. The marriage of Terracotta and Hibernate simplifies application development and greatly improves application performance.

To cut a long story short, there’s an in-memory database for Hibernate to hammer instead. Can people seriously think that that is the solution?

“Simplified” development

Amongst the links that Beth (aka datageekgal) highlights this week is a forthcoming framework from Microsoft to simplify development.

From the article announcing the framework:

The goal of the ADO.Net Entity Framework is to eliminate the impedance mismatch between data models and languages, saving developers from having to deal with these. An example of such a mismatch is objects and relational stores.Developers might write an application that manipulates a CRM system with data for customers stored in 12 different tables…

With the Entity Framework, you can automate the process essentially of bringing all that data together and presenting it to the developer as a single entity so they can interact with it at a higher level of abstraction…

Sounds like Hibernate, Ibatis and other such ORM tools to me.

I’m got two words for you – Stored Procs and/or Object Relational Views (well, two options, seven words).

That’s an abstraction that works every time.

You just need someone who knows what they’re doing on the database – and I know that’s not trendy in IT at the moment ;-)

Mind you, in the past year I’ve worked on my first two .Net projects and, compared to JDBC, the lack of support for Oracle object and collections types in the various data provider options, including ODP, is a frustration. While there is support for associative arrays, Oracle permanent object types (CREATE TYPE …. AS OBJECT / AS TABLE OF ….) are not…. yet.

But I believe that that support is coming in the 11g version of ODP.NET. I’ll try to dig out the link.

Application Hanging

Today a problem with the application “hanging”.

The background is that there has been some development in an application against database A.
Database A is now out of action due to a disk controller failure of something akin to that and so the application is now pointing at database B.
When pointing at database A, the application was performing acceptably, not so when pointing at database B where it is “hanging”.

So, straightaway, in response to one of the questions we should ask ourselves when there is a sudden problem – what has changed? – we know that there has been a change in database and the immediate follow-up is what might the differences be between database A and database B.

Fortunately, one of the differences is already known to me and, given the symptoms, is jumping up and down, waving its hands and shouting ‘Look at me’.

Both database A and database B are relatively recent copies of the production database.
All 8.1.7 by the way.
However, the production database has a bad mix of some tables having statistics (some stale) and some having none which can lead the CBO to make some inappropriate decisions due to the difference between default values it uses for those objects without statistics and the data distribution reality.

Database A had had a statistics update subsequent to the refresh.
Database B had not.

Therefore it is highly likely that the symptom of the application hanging is a less than optimal plan on the SQL concerned.

For the application operation that is hanging, quite a few bits of SQL might be issued.
There might be more than one bit of SQL that is performing unacceptably but in any event, we need to work through each suboptimal performer as we find it.

The application mostly uses ORM to generate its lovely queries (although in this case the development that has occurred is to optimise some of the iterative, slow-by-slow dynamic SQL that is generated).

Therefore the best mechanism is to trace the database session concerned, replicate the “hanging” behaviour and see what it is doing. What I am expecting to find is that, on “hanging”, the last piece of SQL in the trace file will be doing more IO than it needs to.

So, a trigger is required which sets the extended trace event for a specific user:


CREATE OR REPLACE TRIGGER xxxxxx
AFTER LOGON ON DATABASE
DECLARE
BEGIN
IF USER = 'yyyyy'
THEN
--
EXECUTE IMMEDIATE 'ALTER SESSION SET events ''10046 trace name context forever, level 12''';
--
END IF;
END;

The next step is to reproduce the “hanging” behaviour via the application.
Sure enough this generates a trace file on the server which is doing many, many “db file sequential reads”.
More than normal? (where normal is the behaviour when it performs acceptably)
Don’t know for sure yet, but it’s a good bet.

The SQL in question is something like this:


SELECT ....
FROM News t0
, (SELECT DISTINCT T1.newsid
FROM nolindex T2
, IMPORTERS T3
, NewsIndex T1
WHERE T3.destindexid = :1
AND T1.sectionid = T2.indexid
AND T2.indexid = T3.sourceindexid) v1
WHERE t0.ID = v1.newsid

The trace file is incomplete because, due to all the IO, it reached its size limit. Also the application was killed because we had the information we needed.

On database B (some stats but not all), running the SQL with a SET AUTOTRACE TRACE EXPLAIN reveals the following:


Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=216293 Card=59764261584 Bytes=196445127826608)
1 0 MERGE JOIN (Cost=216293 Card=59764261584 Bytes=196445127826608)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'NEWS' (Cost=826 Card=1515434 Bytes=4961530916)
3 2 INDEX (FULL SCAN) OF 'PKNEWSTABLE' (UNIQUE) (Cost=26 Card=1515434)
4 1 SORT (JOIN) (Cost=50339 Card=3943706 Bytes=51268178)
5 4 VIEW (Cost=165128 Card=3943706 Bytes=51268178)
6 5 SORT (UNIQUE) (Cost=165128 Card=3943706 Bytes=256340890)
7 6 NESTED LOOPS (Cost=24 Card=3943706 Bytes=256340890)
8 7 MERGE JOIN (CARTESIAN) (Cost=24 Card=2589433 Bytes=134650516)
9 8 TABLE ACCESS (BY INDEX ROWID) OF 'IMPORTERS' (Cost=2 Card=11 Bytes=286)
10 9 INDEX (RANGE SCAN) OF 'DESTINDEXID_IDX' (NON-UNIQUE) (Cost=1 Card=11)
11 8 SORT (JOIN) (Cost=22 Card=235403 Bytes=6120478)
12 11 INDEX (FAST FULL SCAN) OF 'NEWSINDEXSECTLANGNEWSINDEX' (NON-UNIQUE) (Cost=2 Card=235403 Bytes=6120478)
13 7 INDEX (UNIQUE SCAN) OF 'NOLINDEX_PK' (UNIQUE)

There are some big numbers in there.

Next step is to run against database C which should be similar to database A before it died – i.e. refresh from production with statistics on more objects:


Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=206 Card=89 Bytes=26700)
1 0 NESTED LOOPS (Cost=206 Card=89 Bytes=26700)
2 1 VIEW (Cost=28 Card=89 Bytes=1157)
3 2 SORT (UNIQUE) (Cost=28 Card=89 Bytes=2314)
4 3 NESTED LOOPS (Cost=26 Card=89 Bytes=2314)
5 4 NESTED LOOPS (Cost=2 Card=8 Bytes=120)
6 5 TABLE ACCESS (FULL) OF 'IMPORTERS' (Cost=2 Card=8 Bytes=80)
7 5 INDEX (UNIQUE SCAN) OF 'NOLINDEX_PK' (UNIQUE)
8 4 INDEX (RANGE SCAN) OF 'NEWSINDEXSECTLANGNEWSINDEX' (NON-UNIQUE) (Cost=3 Card=225099 Bytes=2476089)
9 1 TABLE ACCESS (BY INDEX ROWID) OF 'NEWS' (Cost=2 Card=4695495 Bytes=1347607065)
10 9 INDEX (UNIQUE SCAN) OF 'PKNEWSTABLE' (UNIQUE) (Cost=1 Card=4695495)

Some much smaller numbers and completely different access paths.

The numbers themselves are not an issue – it’s the different access paths that is the problem.
What I mean is that if, via hints for example, we forced the second access path on the poorly performing database, explain plan would still have much bigger numbers.

And so, for possible resolutions, we have a number of options, at least four, possibly more:
Option numero uno: Gather appropriate statistics on all objects so that the CBO has a decent chance.
Option numero the next one: Get rid of the partial statistics so that we have none and go back to using RBO.
Option three: Hint the SQL in the application.
Option four: Gather a stored outline for the query.

Options one and two can be discounted due to organisational “fear” of change. I would say that option one – all objects with appropriate statistics – is much preferable to option two. However, neither is a go-er due to fear that this might have a wider detrimental effect. (Incidentally, option one was attempted previously but the deployment was backed out because, according to rumour, the performance of something somewhere degraded)
Option three might be ok. The downside from an application perspective is that to put the hints in requires an application deployment, an application stop and start, and to take it out requires the same. As part of a wider strategy for this particular application, its suitability is questionable because it’s ORM generated code and therefore it’s probably not possible to control the hinting.
That leaves option four – capture a stored outline for the query on the “good” database, export, import to the “bad” database(s) and enable usage thereof. Might require some shared pool flushage but easily turn-on-and-offable, can be repeated easily for other problematic queries.

And just for completeness, if we were going for option four, the steps would be to first create the stored outline on the “good” database:


CREATE OR REPLACE OUTLINE <outline_name>
FOR CATEGORY <category_name>
ON
<sql_statement>;

(Note that the text of the sql statement must match exactly that that will be executed by the application.)

Next, export the outline from the “good database”, e.g.


exp userid=outln/ query="where category = ''<category_name>" tables=(ol$,ol$hints)

Then, import into the “bad” database, e.g.


imp userid=outln/ full=y ignore=yes

Finally, maybe use a trigger to enable the usage of the stored outline for the problem application user:


CREATE OR REPLACE TRIGGER xxxxxxxxx
AFTER LOGON ON DATABASE
DECLARE
BEGIN
IF USER = yyyyyyyyyyyyy
THEN
--
EXECUTE IMMEDIATE 'ALTER SESSION SET USE_STORED_OUTLINES = <category_name> ;
--
END IF;
EXCEPTION WHEN OTHERS THEN NULL;
END;

ORM

Aaaaarggghh. 

ORM – Object Relational Mapping – seems to be the bane of a lot of Oracle Developers / DBAs these days.

 I’m not talking about great Oracle features such as Oracle relational views using UDTs (User-Defined Types) and operators such as CAST, TABLE and MULTISET.

More the object/relational persistence layer type things like Hibernate, iBATIS, JDO, SDO, etc.

I get the whole thing about it saving application developers from writing so much code and therefore there is a reduction in testing and errors, and the application gets written that much quicker (in theory), etc.

But often it’s like death by one thousand little generated SQL statements where one bigger one could do the job much more efficiently.  Either that or you find the most horrific query you’ve ever seen has been generated with seemingly hundreds of tables, silly aliases, and hoardes of UNIONs and ORs, etc.

Maybe one of the problems has been that the DAO layer has never been particularly cool or trendy and that most application developers have never been into writing SQL – it’s always been a bit of a chore and boring to them. But SQL isn’t difficult. You’ve just got to think in sets.

And I’m sure that this one of those scenarios where the oft-quoted 80:20 “rule” can be applied – i.e that an ORM tool might make sense 80% of the time, particularly when SQL experts aren’t available. Trouble is that you can turn that silly rule right around and say that the 20% of code where these ORMs don’t work very well take up 80% of the effort.

The problem for me with this is the database becomes more like just a bucket. And a bucket which is accessed by inefficient SQL. The database was born to store data and manipulate data using set operations. More often than not with ORM, we find row-by-row operations, we don’t see efficient set operations, we don’t see bulk operations, we see dynamically generated IN lists, we see lots of OR clauses, etc.

And then, more often that not, when there’s a problem with that SQL, there’s nothing that can be done about it.

Going back to the O-R features in Oracle, these have been steadily developed since 8i. I’m a big fan of creating a layer of O-R views to create more appropriate interfaces for the application to the database entities and have used them to great success in a varietyof commercial JDBC applications. And it always comes as a surprise to the application developers that it is possible to pass Oracle UDT collections back and forward. Granted, the JDBC is a little fiddly, but it’s a good way of doing efficient set/bulk operations on entities that are a little more natural to the OO world than the base relational entities. It’s a pity that ODP.NET does not yet have the same support for these Oracle Types.

Maybe one day all the application developers or (80%) will be replaced by code generators that work from a few boxes and a few lines put together on something like a Visio diagram. I hope not because I consider myself an application developer/designer starting from the database going out to and including the application.

Alternatively, maybe boxes, memory and disk space get so big and fast that these inefficiencies aren’t a concern anymore (either that or the affects of the inefficiencies are magnified).

Follow

Get every new post delivered to your Inbox.

Join 70 other followers