wv and abiword

Caolan McNamara (Caolan.McNamara@ul.ie)
Thu, 15 Jul 1999 15:32:07 +0100 (IST)


Ok, ive cvsed up the latest wv, what i have at this stage is

1) solid extraction of text from word 97, word 6 and i belive word 7
2) both fastsaved and fullsaved support for the above
3) paragraph property extraction for all three formats, though we may
find that word95 needs more luck as basically ive treated it the same
as word 6 for the moment

4) in the abi subdir of wv i have put in a very simple abiword import
filter which i have been using myself, it only extracts the text and
implements the paragraph alignment tag. Someone who knows what abiword
can actually do at the moment can take up the task of integrating the
filter into abiword and filling in the rest of the paragraph properties
that can be matched from msword to abiword. You'll need the specification
of the PAP as found in word 97, i have upshifted transparently word95 and
word 6 pap's etc etc into word 97 ones.

5) the wv interface to the filter was based sort of on the way the expat filter
does things, except that you get a charData callback for each individual
character in the word file, rather than efficient chunks, slowish i suppose but
easy and stable for the moment. You'll note that there is a chartype field
in the character data handler, this is set to 1 if the char is an 8bit windows
cp1542 (or whatever the bastardized version of iso-5589-1/15 is called) and
to 0 if its a bastardized windows 16bit unicode character.

There is still a lot of work to be done, and many special cases to do blah blah
blah, including char properties. But for the moment with a bit of luck you
should be able to extract text and basic para properties from the majority of
word documents.

Im after finishing it off in a bit of a blur, so a few baddies might have snuck
in, im off to do something else in a hurry.

enjoy.

C.

Real Life: Caolan McNamara * Doing: MSc in HCI
Work: Caolan.McNamara@ul.ie * Phone: +353-86-8790257
URL: http://www.csn.ul.ie/~caolan * Sig: an oblique strategy
Are there sections?



This archive was generated by hypermail 1.03b2.