November 6, 2009 Leave a comment
It’s been years and years since I worked regularly with SQL*Loader-based feeds.
There are loads of tips and tricks regarding formats, encodings, character sets, etc.
I had forgotten nearly everything that I could ever have claimed to know in this area, all aged out of my personal buffer cache, at best distant memories, very distant.
However, I’ve recently had to resurrect some of these distant memories to feed some data via External Tables.
I thought it would be worthwhile to do an incoherent brain dump on some of the particular issues for future reference.
This article is my no means comprehensive and will just touch on some of the considerations specific to what I was doing.
My circumstances involved the ORACLE_LOADER driver.
As the names suggest, ORACLE_LOADER is related to SQL*LOADER. In fact, you can use the EXTERNAL_TABLE=GENERATE_ONLY parameter with SQL*LOADER to generate the syntax for your EXTERNAL TABLE.
I think things like this are brilliant but I never use them. I suppose I’m old skool. Neat features that auto generate code mean that a) I don’t necessarily need to understand what’s going on and b) skip the line-by-line sanity checks that I still believe in.
I work in an environment where it’s not so easy to get access to servers to ftp and view files, etc.
As a result, my approach would be to work with SQL*Loader early on and then graduate to external files nearer the time. However, I’m going to skip all that, fast forward past the SQL*Loader bit and move on to the issues.
So the main considerations for loading up data?
- Encoding / Character sets
- Delimiters – field and line
- File Transfer
In my experience, when you get errors whilst loading, it’s due to one or more of the reasons above.
More than one?
For example, the combination of file transfer mechanism and delimiters.
Ever wondered why there’s confusion over whether you should transfer a file as text (or ascii), as binary? What’s the difference?
As ever, it depends.
- SFTP doesn’t have a text mode, all transfers are binary.
- FTP lets you choose between ascii (default) and binary.
- What about WinSCP and similar tools with their automatic mode?
Automatic modes usually work off the file extension and would pick a text transfer for HTML, TXT, PHP, XML, etc and binary otherwise.
And the differences in mode?
For Text mode, there are two basic methods – either the tool is responsible for doing some conversion to the format supported by the destination or, more commonly, the client converts to a canonical format and the service then does a further conversion to its own format if necessary.
With Binary mode, the raw bytes are transferred as is, i.e. the file is transferred in its original form.
So, the main significance of this related to field and line delimiters – things like tabs and line feed characters are different between platforms, different between Unix and Windows for example.
This was particularly relevant for my file export from SQL Server to Oracle on Linux.
In Windows, a new line is often represented by two characters – one carriage return and one line feed.
In Unix, a new line is normally just a line feed.
Sometimes you see “^M” characters on *nix. What’s this about?
This represents the carriage return part of the newline for Windows as described above.
If you need to, you can get rid of that using dos2unix.
So, in the ACCESS PARAMETERS subsection, if you use the “RECORDS DELIMITED BY NEWLINE” syntax in your external table definiton, what does that mean?
The NEWLINE keyword uses the newline format for your platform – so just a line feed in the case of *nix.
For my process, I decided that I would just go with the format as extracted from SQL Server – a 50:50 decision that there’s no point on expanding on.
So, in my situation, the syntax for an external table should not use the NEWLINE keyword because the format was Windows new lines, the syntax for which can be:
RECORDS DELIMITED BY '\r\n'
Here’s a situation. Things have been going fine in DEV,etc and eventually go to PREPRODUCTION and there’s a problem.
The feeds don’t work. The external table loader isn’t finding the right delimiters.
The question really is how can you tell what characters are in your file?
So, there’s no point looking at the file prior to transfer in case the technological clue (or the person in a manual process) changes the file on transfer. We need to look at it in the destination directory on the destination server.
And what’s the best way to do that?
On a Unix/Linux, a useful command is the od command which dumps files in octal and other formats. For example,
od -c <filename>
which gives ascii characters or backslash escapes.
There’s a lot of output from this command but you don’t need to do the whole file, e.g.
head -2 <filename> | od -x | more
If the file suddenly doesn’t match the expected format, what’s the sort of errors you might get?
Well, if it should have transfered in binary but it was done in text by error, you might get:
KUP-04020: found record longer than buffer size supported
i.e. the lines are running into each other
or under slightly different circumstances
KUP-04023: field start is after end of record
That’s about it on transfers and delimiters.
What’s there to say about encoding and character sets.
Using the od command above we can see the hex codes to double check the encoding is as expected.
Other that that, in our external table definition we can specify the character set in the ACCESS PARAMETERS section using the CHARACTERSET keyword, e.g.
CREATE TABLE <my_tablename> (<my_columns>) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY <my_directory> ACCESS PARAMETERS ( RECORDS DELIMITED BY '\r\n' CHARACTERSET AL16UTF16 FIELDS TERMINATED BY '\t' MISSING FIELD VALUES ARE NULL ) LOCATION (<my_directory>:'<my_filename>') ) REJECT LIMIT 0;
and then the expected characterset conversion will take place.
Note that SQLServer tends to use USC2 as its unicode characterset and UCS2 is a subset of AL16UTF16.
It’s Friday, it’s 5 o’clock, I’ve lost focus. I’m done.