Oracle Text & Query Templating – Query Rewrite / Relaxation

Traditional use of Oracle Text involves writing a WHERE clause using the CONTAINS operator on the relevant column and SELECTing the score using the SCORE operator, e.g.


SELECT programmename
, score(1)
FROM programme
WHERE CONTAINS( programmename , ’Weakest Link’,1) > 0
ORDER BY score(1) DESC

Query templates are features new to Oracle 10g and are an alternative to the previous query language. Rather than passing a query string, as illustrated above, you pass a structured document which contains the query string in a tagged element. Within this structured document, you can make use of additional query features such as query rewrite and query relaxation.

The query rewrite feature enables you to submit a single query that expands the original query into the rewritten versions, e.g.


SELECT programmename
, score(1)
FROM programme
AND CONTAINS ( programmename , '
<query>
<textquery lang="ENGLISH" grammar="CONTEXT">Weakest Link
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "SYN(", ",LISTING)", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "!", "%", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite>/seq>
<seq><rewrite>transform((TOKENS, "${", "}", "AND"))</rewrite></seq>
</progression>
</textquery>
<score datatype="INTEGER" algorithm="DEFAULT"/>
</query>', 1) > 0
ORDER BY score(1) DESC

Query relaxation looks similar to query rewrite and enables the search to execute the most restrictive version of a query first, progressively relaxing the query until the required number of hits are obtained, e.g.


SELECT programmename
, score(1)
FROM programme
AND CONTAINS ( programmename , '
<query>
<textquery lang="ENGLISH" grammar="CONTEXT">Weakest Link
<progression>
<seq>SYN(Weakest Link,LISTING)</seq>
<seq>!{weakest link}%</seq>
<seq>?weakest ?link</seq>
<seq>weakest OR link
<seq>?weakest OR ?link</seq>
<seq>fuzzy({weakest link},60,100,W)</seq>
</progression> </textquery>
<score datatype="INTEGER" algorithm="DEFAULT"/>
</query>', 1) > 0
ORDER BY score(1) DESC

Usage of these features benefit a search application where a variety of searches – exact match, synonym searching, traditional fuzzy (spelling), stem search – may be required to find the best match.

Traditionally, a strategy for handling this would be to issue a number of different individual queries or one query with a number of ORs in the WHERE clause. This approach is inefficient and duplicate results may be returned.
Using query templating to relax a query as such is more efficient than re-executing a query. In addition, by using query templating rather than individual queries, duplicate results are eliminated.
The scores returned are manipulated such that if you order by score, you can be sure that all the rows specified by an earlier criteria will be returned before those specified by a later criteria.
The scoring mechanism used is slightly different. The algorithm as explained above is still applied to each search but the percentages returned are banded according to which search returned the result. In other words, if we have 4 different searches within our query template, the top search would return hits scored between 100% and 76%, the next 75%- 51%, the next 50% – 26% and the last 25% to 1%.

Scoring Algorithm Addendum
It should be noted that the above examples do not demonstrate a good usage of the default scoring algorithm based on the nature of the data indexed. The examples above are taken from an enterprise application where the programme table holds television programme titles, e.g. Eastenders, Weakest Link, etc.

The nature of the default scoring algorithm is such that a search of “Eastenders” against a one word indexed document of “Eastenders” would not return 100%. This is because the algorithm depends on the number of occurences of a searched term against the length of the document. From the docs:


To calculate a relevance score for a returned document in a word query, Oracle Text uses an inverse frequency algorithm based on Salton's formula.

Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.

The following table illustrates Oracle Text's inverse frequency scoring. The first column shows the number of documents in the document set, and the second column shows the number of terms in the document necessary to score 100:

Number of Documents in Document Set Occurences of Term in Document Needed To Score 100
1 34
5 20
10 17
50 13
100 12
500 10
1000 9
10,000 7
100,000 5
1,000,000 4

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 70 other followers

%d bloggers like this: