Re: Word 8 Import work

Justin Bradford (justin@ukans.edu)
Tue, 9 Mar 1999 15:19:52 -0600 (CST)


> The second approach is to use much (but not all) of the
> mswordview code as a static library, and basically model
> the top-level parsing (and HTML generation) that goes on
> from mswordview, but in our own code which would call the
> proper piece table populate functions. We'd basically just
> be merging in all the support code for mswordview (and there's
> quite a bit of it), and recreating the code in the if()
> statements to do our thing. This would be efficient, but
> would require more understanding of how mswordview does its thing.

This was the plan I had adopted, but have not done much with.
I'm currently in the phase of studying mswordview, and was going to
begin modularizing it so that the HTML parts could be replaced by an
Abiword processor.

> The third is to use Roberto Arturo Tena Sánchez's COLE
> (the successor to oledecod, used in mswordview) as the stream
> parser, and model Caolan's parsing logic to grab the juicy bits
> from the stream. This last one seems like the cleanest and most
> maintainable method, especially since COLE can evolve outside our
> work, but it's also the most effort to get going. I've
> already started hacking on a class that does this, but there's
> a lot of redundant "support" code between AbiWord and mswordview,
> so it'll take me a little while to get the top-level parser working.

Well, COLE is actually a very tiny part of a Word8 importer.
I had planned to move the modular mswordview to that, but the
oledecod it uses now is really just an earlier version of COLE.

> What would you suggest, assuming you've been digging around
> in the mswordview code a bit? Is it reasonable as a library?

The Word file format is complex and it looks like Microsoft's
documentation is occassionally vague and sometimes just wrong.
MSWordView already does the low-level stuff for us, but the
logic for HTML output is fairly embedded. With all due respect to
Caolan, the source is hard to follow and fairly messy, but a lot of
that is just due to the complexity of the format. It would
require a significant amount of work to make it into the
library Abiword needs. However, making a new importer as complete
as MSWordView is now would take even more time.

Probably the best option is to start with MSWordView to make
a set of functions to handle all of the extraction and basic processing
(like expanding PAPX and CHPX records). This will require substantial
internal reworking, including:
1. Replacing the error and abort routines
It will need a flexible way to return errors, and we might
want a way to record messages, such as nonfatal incompatibilities
and unimplemented features...
2. MSWordView uses a lot of global variables which will need to be
worked into a large doc struct of some kind.

Then Abiword specific functions will be needed to deal with the data
it extracts.

Justin Bradford
justin@ukans.edu



This archive was generated by hypermail 1.03b2.