Loader

It’s been years and years since I worked regularly with SQL*Loader-based feeds.

There are loads of tips and tricks regarding formats, encodings, character sets, etc.
I had forgotten nearly everything that I could ever have claimed to know in this area, all aged out of my personal buffer cache, at best distant memories, very distant.

However, I’ve recently had to resurrect some of these distant memories to feed some data via External Tables.

I thought it would be worthwhile to do an incoherent brain dump on some of the particular issues for future reference.

This article is my no means comprehensive and will just touch on some of the considerations specific to what I was doing.

The starting place for documentation on SQL*Loader is the Oracle Database Utilies Manual and similarly for External Tables.

The thing about external tables is that they can use the ORACLE_LOADER or the ORACLE_DATAPUMP access drivers.

My circumstances involved the ORACLE_LOADER driver.

As the names suggest, ORACLE_LOADER is related to SQL*LOADER. In fact, you can use the EXTERNAL_TABLE=GENERATE_ONLY parameter with SQL*LOADER to generate the syntax for your EXTERNAL TABLE.

I think things like this are brilliant but I never use them. I suppose I’m old skool. Neat features that auto generate code mean that a) I don’t necessarily need to understand what’s going on and b) skip the line-by-line sanity checks that I still believe in.

I work in an environment where it’s not so easy to get access to servers to ftp and view files, etc.

As a result, my approach would be to work with SQL*Loader early on and then graduate to external files nearer the time. However, I’m going to skip all that, fast forward past the SQL*Loader bit and move on to the issues.

So the main considerations for loading up data?
– Encoding / Character sets
– Delimiters – field and line
– File Transfer

(I’m going to ignore Endianness for the sake of “clarity”. But it can be a factor. See Byte Ordering for futher information.)

In my experience, when you get errors whilst loading, it’s due to one or more of the reasons above.
More than one?
Sure.

For example, the combination of file transfer mechanism and delimiters.
Ever wondered why there’s confusion over whether you should transfer a file as text (or ascii), as binary? What’s the difference?

As ever, it depends.
– SFTP doesn’t have a text mode, all transfers are binary.
– FTP lets you choose between ascii (default) and binary.
– What about WinSCP and similar tools with their automatic mode?

Automatic modes usually work off the file extension and would pick a text transfer for HTML, TXT, PHP, XML, etc and binary otherwise.

And the differences in mode?

For Text mode, there are two basic methods – either the tool is responsible for doing some conversion to the format supported by the destination or, more commonly, the client converts to a canonical format and the service then does a further conversion to its own format if necessary.

With Binary mode, the raw bytes are transferred as is, i.e. the file is transferred in its original form.

So, the main significance of this related to field and line delimiters – things like tabs and line feed characters are different between platforms, different between Unix and Windows for example.

This was particularly relevant for my file export from SQL Server to Oracle on Linux.

In Windows, a new line is often represented by two characters – one carriage return and one line feed.
In Unix, a new line is normally just a line feed.

Sometimes you see “^M” characters on *nix. What’s this about?
This represents the carriage return part of the newline for Windows as described above.
If you need to, you can get rid of that using dos2unix.

So, in the ACCESS PARAMETERS subsection, if you use the “RECORDS DELIMITED BY NEWLINE” syntax in your external table definiton, what does that mean?

The NEWLINE keyword uses the newline format for your platform – so just a line feed in the case of *nix.

For my process, I decided that I would just go with the format as extracted from SQL Server – a 50:50 decision that there’s no point on expanding on.

So, in my situation, the syntax for an external table should not use the NEWLINE keyword because the format was Windows new lines, the syntax for which can be:

RECORDS DELIMITED BY '\r\n'

Here’s a situation. Things have been going fine in DEV,etc and eventually go to PREPRODUCTION and there’s a problem.

The feeds don’t work. The external table loader isn’t finding the right delimiters.

The question really is how can you tell what characters are in your file?
So, there’s no point looking at the file prior to transfer in case the technological clue (or the person in a manual process) changes the file on transfer. We need to look at it in the destination directory on the destination server.

And what’s the best way to do that?

On a Unix/Linux, a useful command is the od command which dumps files in octal and other formats. For example,

od -c <filename>

which gives ascii characters or backslash escapes.
There’s a lot of output from this command but you don’t need to do the whole file, e.g.

head -2 <filename> | od -x | more

If the file suddenly doesn’t match the expected format, what’s the sort of errors you might get?
Well, if it should have transfered in binary but it was done in text by error, you might get:

KUP-04020: found record longer than buffer size supported

i.e. the lines are running into each other
or under slightly different circumstances

KUP-04023: field start is after end of record 

etc.

That’s about it on transfers and delimiters.

What’s there to say about encoding and character sets.
Using the od command above we can see the hex codes to double check the encoding is as expected.
Other that that, in our external table definition we can specify the character set in the ACCESS PARAMETERS section using the CHARACTERSET keyword, e.g.

CREATE TABLE <my_tablename>
(<my_columns>)
ORGANIZATION EXTERNAL
  (  TYPE ORACLE_LOADER
     DEFAULT DIRECTORY <my_directory>
     ACCESS PARAMETERS 
       ( RECORDS DELIMITED BY '\r\n'
       CHARACTERSET AL16UTF16
       FIELDS TERMINATED BY '\t' 
       MISSING FIELD VALUES ARE NULL  )
     LOCATION (<my_directory>:'<my_filename>')
  )
REJECT LIMIT 0;

and then the expected characterset conversion will take place.
Note that SQLServer tends to use USC2 as its unicode characterset and UCS2 is a subset of AL16UTF16.

It’s Friday, it’s 5 o’clock, I’ve lost focus. I’m done.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 69 other followers

%d bloggers like this: