wv: windows codepages, changes to allow conversion to unicode

Caolan McNamara (Caolan.McNamara@ul.ie)
Tue, 09 Nov 1999 15:48:27 -0000 (GMT)


It has come to my attention that there are many word documents that
use 8 bit characters. And use the windows codepages with them. To
support this kind of stuff I intend to add another variable to
the charhandler function named "lid". The lid will be ms word's
language identifer of the character involved.

At all times you can use this to determine what language the text
is in. This is a bit of an idea for abiword for spell checkers, to have
part of the document in "english" and the rest in "german" and have
different spell checkers for each region.

Anyway back to what I am doing, if the character is a 16 bit unicode
character there is no problem. If the character is 8 bit then in my
code I will take the 8 bit character and figure out what windows
codepage is being used for it and convert the character to unicode
through one of the available mapping table, and then we can handle the
resulting unicode character as normal.

Now abiword uses the charhandler so it will have to add another variable
to match the wv definition. After that, abiword will have to decide what
it wants to do when it is given an 8 bit character and a lid, the character
will still be given to the charhandler exactly as it was found in the word
document, but there will be utility functions available to convert the
character to unicode, something that works like this maybe

(function pointer) = wvGetCodePageConverter(lid)
unicodechar = (function pointer)(8 bit codepage character);

C.

This change will be in the next cvs commit I do tomorrow or later on, so
I'm afraid that there will probably be some failed builds because of it.

Real Life: Caolan McNamara * Doing: MSc in HCI
Work: Caolan.McNamara@ul.ie * Phone: +353-86-8790257
URL: http://www.csn.ul.ie/~caolan * Sig: an oblique strategy
Work at a different speed



This archive was generated by hypermail 1.03b2.