wv 0.5.41, major charhandling changes.

Caolan McNamara (Caolan.McNamara@ul.ie)
Fri, 12 Nov 1999 12:38:02 -0000 (GMT)


Some wv structural changes which are rather profound, namely

Word documents often contain 8 bit characters in the windows codepages,
cp-1251 (russian) etc etc. These have to be converted to something useful
in the real world, There exists a call on some unices called iconv which
is for conversion of charsets. The glibc2 one can handle windows codepages.
But other platforms such as solaris have iconv but its a useless
implementation, so wv does the following at configure time

test for the existance of iconv, and test if it can handle the windows
codepages that I am currently supporting. If it does (it appears that
only glibc2 fits this profile) then we use that for our windows codepage
to unicode conversion, if it does not then I have a mini iconv implementation
which can only handle windows codepage into unicode conversion, unicode
into utf, unicode into koi8-r, unicode into iso-8859-15 and unicode into
tis-620 (thai).

The wv library does not explicitly convert the incoming characters into
unicode at the moment, I want to keep it seperate for a while so we can work
out the bugs. So users of the wv api get their characters through the
charhandler which is speced as follows.

int charhandler(wvParseStruct *ps,U16 eachchar,U8 chartype,U16 lid);

eachchar is the character, chartype is if it is unicode already (0)
or not (1) and therefore needs to be converted to unicode. lid is
the language identifer which always tells you the language that the
text is in, and is used for a conversion to unicode when the chartype
is 1.

char *wvLIDToCodePageConverter(U16 lid)

will return the name of the codepage that is associated with a language.

The mini iconv implementation in wv is the same api as the standard one,
read the iconv manpage to see how it works, or read text.c at
wvHandleCodePage to see how I use iconv.

So wv api users currently have to check for the existance of a 1 in
the chartype field, and then look up the codepage name with
wvLIDToCodePageConverter and then create and use an iconv handle so
as to convert the charset into unicode (or whatever you desire).

wvHtml (which itself is just a wv api using application) does this
process itself, albeit very slowly and currently sets up an iconv instance
for *each* character (run away, run away!). A character scanner using
wv should really buffer the characters it gets from wv and flush the
buffer (and examine the need for a new iconv handle) when

a) a special character is received on the other handler
b) the chartype flag changes
c) the lid changes when chartype was previously 1 (for 8 bit)

My next iteration at the issue will have a charhandler which
a) always hands off unicode characters, no matter what the original
origin
b) has an internal buffer itself which it will hand off to wv api
users in large chunks rather than single characters.
This will be a good bit easier to use, but it doesnt exist yet so...

All of that relates to the input of word characters into a scanner,
this is just a little section on the output of characters from wvHtml.

wvHtml users have often disliked the default output to utf-8, and in
the past I have taken some submissions to convert the default unicode
output to koi-8r and one or two others. Because of the change to
iconv this means that my own mini implementation still supports this
small subset of output charsets, *but* if you are on a system with
a suitably complete iconv (i.e. just glibc2 as found on new redhat and
debian etc linux and bsd distributions) then you can convert the output
to just about anything (cool eh!). Just try
wvHtml --charset WhatEverYouWant test.doc

I have updated
wv/notes/internationalization/Charsets-HOWTO
to reflect these changes.

Due to the changeover and rationalization of the charset handling I
may have lost some of the specific character to html equivanent
conversions that I have put in there, things like & are correct, but
there were a few others that might have slipped through the net, let
me know of any which have reverted to nonsense "?" characters.

During the process of this rather mindbending modification spree a
stack of new files were added and moved around the build tree with a
few new Makefile and dirs added so while I believe everything will work
ok, who knows. I have tested it and verifyed that it works on

redhat 4.2 with no native iconv using my own.
redhat 6.1 with a perfect iconv and using it.
solaris somethingorother with a useless iconv and using my own.

I have yet to perfect the configure test to see if aix and hpux sgi
etc have working iconv's. I believe that the configure test to
find them will fail and use the internal iconv support. The modifications
to the build also allow wv to be built cleanly outside of its own dir, this
might also make the sgi build work correctly as well (small hope there I bet)

Almost finally I would be interested in knowing what the windows equivalent
to iconv is, and about a configure test etc etc to look for it and
map it to the semantics of iconv. Any ideas ?

And finally thanks a lot (good humoured sarcasm) to svu@pop.convey.ru
who proved that my faith in word97 always using unicode for languages
that should use unicode was completely misplaced and caused this
mushrooming of issues for me to contend with.

C.

p.s Phew!

Real Life: Caolan McNamara * Doing: MSc in HCI
Work: Caolan.McNamara@ul.ie * Phone: +353-86-8790257
URL: http://www.csn.ul.ie/~caolan * Sig: an oblique strategy
Remove ambiguities and convert to specifics



This archive was generated by hypermail 1.03b2.