Re: wv: windows codepages, changes to allow conversion to unicod

Caolan McNamara (Caolan.McNamara@ul.ie)
Wed, 10 Nov 1999 09:27:10 -0000 (GMT)


On 10-Nov-99 Justin Bradford wrote:
>> >(function pointer) = wvGetCodePageConverter(lid)
>> >unicodechar = (function pointer)(8 bit codepage character);
>>
>> OK. If you and Justin agree that the API should really be this low-level,
>> that's fine by me. AFAIK, we'll never want to see the character variant in
>> the Windows codepage, but just go directly to the full unicode equivalent.
>
>While we probably will never care about anything but Unicode, it might be
>possible other apps using wv do (or will). I don't think adding this extra
>conversion step is a big deal, and it makes wv more flexible for other
>uses. Possibly, a UnicodeCharHandler (which automatically converts to
>Unicode for us) in addition to the current CharHandler would be in order
>to hide some of this from the application. I imagine most apps would
>prefer straight Unicode, anyway. How do you feel about that Caolan? I can
>add it, if you think it's a good idea.
>
>As for the conversion tables, I think it's a great idea. I'm currently
>mapping a few Windows codepage oddities in the importer code, and there's
>no way that was ever going to become an extensible, non-kludge solution to
>the problem.

How about that
wvSetCharHandler(myCharProc);
is changed so that is becomes either
wvSetCharHandler(myCharProc,NATIVE);
or
wvSetCharHandler(myCharProc,UNICODE);

and that then the charhandler is passed either the
original characters, or the unicode characters
depending on how the charhandler was set ?

On unicode conversion tables etc, there are some features in glibc2 that
can supposedly do some of this conversion stuff already. I do not know
what tables are included, and I can't even remember the names of the
functions but I think it is named iconv. The tables are not physically
large, but of course if every app starts storing them independently then
they will be a large quantity of bloat.

The optimum solution would be that the gnu lib has a working iconv which
can handle all of the windows codepage to unicode conversion, and that
we can test for the required level of support at compile time and only
in the case that it is needed then install the mapping tables.

Currently I convert the mapping tables into c code and use that because
I only was looking at the cp1252 oodepage and not worrying about the
other ones, seeing as in word97 the vast majority of non ascii
documents are already in unicode.

I must have a bit of a look around at the accepted ways of handling
unicode mapping tables, maybe a ready built table parser, level of
iconv support in various os's and some configure tests for it maybe.

The other small issue with the mapping tables and word docs is that
word stores characters in the "Symbol" and "Wingdings" font amoung
others as "special" characters. These are passed through the
special char handler. It is possible to convert the symbol font
into unicode through a mapping table, and this is what I do myself
when i receive a spcial symbol character out of the word stream, but
there is no unicode mapping table for wingdings. I have put
together a table which can convert into valid unicode characters about
100 of the wingding glyphs, but the rest do not exist in unicode, so
the wingdings font is always going to be a problem for a conversion.
Admittedly its not a serious issue, just something that niggles at me.

One thought I have is that if we register the charhandler to always receive
unicode I would like to short circuit the "special character" handler for the
symbol font and upshift it to unicode and then run it through the ordinary
character handler when it is in unicode always mode, and to repeat this for
the unicode character matchs in the wingdings font, and fall back to the
spec char handler for the not matched cases. Though this might be overly
complex, and it might be better to leave things as they are in the context
of "special characters"

C.

Real Life: Caolan McNamara * Doing: MSc in HCI
Work: Caolan.McNamara@ul.ie * Phone: +353-86-8790257
URL: http://www.csn.ul.ie/~caolan * Sig: an oblique strategy
Give way to your worst impulse



This archive was generated by hypermail 1.03b2.