Re: Word 8 Import work

Caolan McNamara (Caolan.McNamara@ul.ie)
Wed, 10 Mar 1999 11:30:08 -0000 (GMT)


On 09-Mar-99 Justin Bradford wrote:
>
>Well, COLE is actually a very tiny part of a Word8 importer.
>I had planned to move the modular mswordview to that, but the
>oledecod it uses now is really just an earlier version of COLE.
>
>> What would you suggest, assuming you've been digging around
>> in the mswordview code a bit? Is it reasonable as a library?
>

One of the reasons that the current version uses this older version
is that a hell of a lot of word docs that get sent through
mswordview are corrupt. I assume because they dont work
under windows people give mswordview a try. To this end theres
some extra error checking in my version of the filters
group ole thing, i just havent gotten around to sending in fixes
to cole. I have the advantage over them of having at this stage
500megs of uploaded word files that i can automatically check for
cole crashes against.

>The Word file format is complex and it looks like Microsoft's
>documentation is occassionally vague and sometimes just wrong.
>MSWordView already does the low-level stuff for us, but the
>logic for HTML output is fairly embedded. With all due respect to
>Caolan, the source is hard to follow and fairly messy, but a lot of
>that is just due to the complexity of the format. It would
>require a significant amount of work to make it into the
>library Abiword needs. However, making a new importer as complete
>as MSWordView is now would take even more time.

Yep, its all over the shop :-), really nasty messy organic stuff. But
in truth the docs are definitely often vague, and they are actually
wrong or missing important notes here and there. Lists are one area
where the docs and reality often part ways. And tables in complex
mode are just plain ridiculous.

>
>Probably the best option is to start with MSWordView to make
>a set of functions to handle all of the extraction and basic processing
>(like expanding PAPX and CHPX records). This will require substantial
>internal reworking, including:
>1. Replacing the error and abort routines
> It will need a flexible way to return errors, and we might
> want a way to record messages, such as nonfatal incompatibilities
> and unimplemented features...
>2. MSWordView uses a lot of global variables which will need to be
> worked into a large doc struct of some kind.

Yes, a set of functions that just return the basic structures of the
word format is what is required. Of course i didnt do that as the
first order of the game was to get something functional together, and
then i just added one feature after another.

>Then Abiword specific functions will be needed to deal with the data
>it extracts.

Most of the globals and the convoluted logic are related to dealing with
the data once its extracted, if the api was created then the logic can
handlily be extracted away and become clearer. Personally im working
on wmf and emf extraction and conversion from word docs at the moment,
which im doing in a more sensible library based fashion.

C.

>
>Justin Bradford
>justin@ukans.edu

Real Life: Caolan McNamara * Doing: MSc in HCI
Work: Caolan.McNamara@ul.ie * Phone: +353-61-202699
URL: http://www.csn.ul.ie/~caolan * Sig: an oblique strategy
Cut a vital connection



This archive was generated by hypermail 1.03b2.