How about that
wvSetCharHandler(myCharProc);
is changed so that is becomes either
wvSetCharHandler(myCharProc,NATIVE);
or
wvSetCharHandler(myCharProc,UNICODE);
and that then the charhandler is passed either the
original characters, or the unicode characters
depending on how the charhandler was set ?
On unicode conversion tables etc, there are some features in glibc2 that
can supposedly do some of this conversion stuff already. I do not know
what tables are included, and I can't even remember the names of the
functions but I think it is named iconv. The tables are not physically
large, but of course if every app starts storing them independently then
they will be a large quantity of bloat.
The optimum solution would be that the gnu lib has a working iconv which
can handle all of the windows codepage to unicode conversion, and that
we can test for the required level of support at compile time and only
in the case that it is needed then install the mapping tables.
Currently I convert the mapping tables into c code and use that because
I only was looking at the cp1252 oodepage and not worrying about the
other ones, seeing as in word97 the vast majority of non ascii
documents are already in unicode.
I must have a bit of a look around at the accepted ways of handling
unicode mapping tables, maybe a ready built table parser, level of
iconv support in various os's and some configure tests for it maybe.
The other small issue with the mapping tables and word docs is that
word stores characters in the "Symbol" and "Wingdings" font amoung
others as "special" characters. These are passed through the
special char handler. It is possible to convert the symbol font
into unicode through a mapping table, and this is what I do myself
when i receive a spcial symbol character out of the word stream, but
there is no unicode mapping table for wingdings. I have put
together a table which can convert into valid unicode characters about
100 of the wingding glyphs, but the rest do not exist in unicode, so
the wingdings font is always going to be a problem for a conversion.
Admittedly its not a serious issue, just something that niggles at me.
One thought I have is that if we register the charhandler to always receive
unicode I would like to short circuit the "special character" handler for the
symbol font and upshift it to unicode and then run it through the ordinary
character handler when it is in unicode always mode, and to repeat this for
the unicode character matchs in the wingdings font, and fall back to the
spec char handler for the not matched cases. Though this might be overly
complex, and it might be better to leave things as they are in the context
of "special characters"
C.
Real Life: Caolan McNamara * Doing: MSc in HCI
Work: Caolan.McNamara@ul.ie * Phone: +353-86-8790257
URL: http://www.csn.ul.ie/~caolan * Sig: an oblique strategy
Give way to your worst impulse