Doing globalisation the hard way?
January 24, 2007 3 Comments
I’m currently working with a database that supports global content. The majority of content is in English but the DB contains articles published in other Western European languages, in Cyrillic, Chinese Traditional, Chinese Simplified and Arabic.
This is an 8i Oracle database set up with a UFT8 characterset, which is a good start.
Unfortunately, however, from purely a database point of view, that is really the only good news because for all non-Western European languages, the data is stored as gobbledygook.
As is so often the case with a mature application, the Whys and Wherefores have been lost. And there is a tenuous grip on the Hows as well. But on getting the data out of the database it is encoded by the application using Windows code page 1252. If the data is meant to be in one of the non-Western European languages, then a further encoding is applied – CP1256 for Arabic, GB3212 for Chinese simplified, Big5 for Chinese Traditional, and Windows 1251 for Cyrillic.
The long and the short of it is if you look at the foreign language data in the database and inspect it using the DUMP function, it’s all Western European. From a pure DB perspective, it’s garbage. Bits and bobs that only mean something to the application. Perhaps you could even say that to a certain extent this is one of the cases where the database is being used as a bit bucket.
So, the only way that you can make sense of a large proportion of the content is through the application.
Is this what is known as logically corrupt data?