POW -- Beginning the Binary Word Exporter

Subject: POW -- Beginning the Binary Word Exporter
From: Justin Bradford (justin@ukans.edu)
Date: Wed Mar 15 2000 - 17:28:24 CST

sorted by: [ date ] [ thread ] [ subject ] [ author ]
Next message: sam th: "AWN #6 Final Draft"
Previous message: Justin Bradford: "Re: MSWord file crashes Abi-Word 0.7.8"
Next in thread: James Montgomerie: "Pre-emptive ZAP -- POW -- Beginning the Binary Word Exporter"
Reply: James Montgomerie: "Pre-emptive ZAP -- POW -- Beginning the Binary Word Exporter"

background
----------
As recent threads on the list have discussed, we need a binary Word
exporter. The current plan is to implement this functionality in the
existing wvware framework (cvs module wv).

However, there are several steps from here to a functioning exporter.
Conveniently, many are easily modularized and appropriate for work
in parallel. So this is a multi-part, multi-hacker POW.

The following steps won't get us all the way to Word export, but it's
a beginning at assembling the various pieces needed for a functioning
exporter.

scope
-----
Part 1a.
Abstract wv's current OLE stream reads

This requires no knowledge of the Word format or, for the most part,
wv functionality. We just want to improve wv's existing file support
functions to make them a little more versatile.

wv currently uses a set of functions (in wv/support.c) along the
lines of:
U16 read_16ubit(FILE*);
U32 read_32ubit(FILE*);
U16 dread_16ubit(FILE*, U8**);
U32 dread_32ubit(FILE*, U8**);
U8 dgetc(FILE*, U8**);

The normal getc(FILE*) function is also used throughout the code.

We should modify the above functions, and all of the existing
wv code (only that which is reading from OLE streams, of course).
to make use of a wvStream* in place of a FILE*. wvStream
will initially be a typedef to FILE (ie 'typedef FILE wvStream').

Don't forget the wvOLE* functions, too, which is were the
abstraction begins. You get FILE* from the old OLE code, and you'll
cast them to wvStream* here.

Also, all getc's will be replaced with a new support function.
"U8 read_8ubit(wvStream*);" seems like a logical choice. While
we're at it, I'd like to move dgetc renamed to dread_8ubit, too.

This abstraction lets us replace the OLE back-end transparently
to the rest of the code, which is the next step.

---- Part 1b. Add support for libole2 (read and write)

libole2 is a OLE2 library for reading and writing to OLE2 structured storage objects (ie. Office files). It can be found in gnumeric currently. They may have split it out into a standalone library, but if not, just extract the code from gnumeric's CVS repository.

It should go into the wv tree, under wv/ole2 perhaps. It might make use of glib, so we'd either want to grab that, too, or remove the glib dependencies.

Then, write d/read_Xubit functions for using libole2 (instead of the current fread-based implementations).

Then, write write_Xubit functions for using libole2.

Again, these functions will go in wv/support.c

We'll want to #ifdef this new code and the old read functions with a sensible #define. Perhaps, somthing like: #define OLE2MODULE LIBOLE2 #define OLE2MODULE OLEDECOD // the current OLE2 code

Then, reimplement the functions in wv/laolareplace.c for use with libole2. Specifcally, we care about:

wvOLEDecode, wvOLEFree, and wvFindEntry

We'll probably want to put these in their own file. Perhaps wv/libole2.c?

wvOpenPreOLE (in wv/wvparse.c) only needs to cast the input FILE* to wvStream* (which should have been accomplished into part 1a). It just fakes the OLE streams for early Word formats.

wvFindEntry will probably need to be changed to return OLE2 entries in a non-OLE2 library dependent way. It will only be relevant for allowing the converter to extract arbitrary embedded OLE2 objects (such as an Excel graph), which is definitely not an immediately pressing need. When someone gets to this function, we'll discuss it further.

The compilation of either:

1) wv/laolareplace.c and wv/oledecod/* 2) wv/libole2.c (or some similarly named file) and wv/ole2/*

should be conditional on the same #ifdef/#define stuff mentioned above. Simply, link in the right set of code for the specific OLE2 implementation.

At this point, the infrastructure is in place for writing to streams in an OLE2 structured storage object. This is the bottom level of structure in a Word file, and now we need code to write pieces of the Word data into these streams in the right format.

----- Part 2. Write wvPut* functions.

While browsing through the wv source, you'll probably noticed many things like wvGetFIB, wvGetBTE, etc. These functions (and possibly some associated helper functions) read from the OLE stream and store the information in memory (with various structs, arrays, and lists).

Now, we need to go the other way. In the appropriate file, create the wvPut* function (and associated helper functions, if necessary) to write back to an OLE stream from the passed struct.

wvPutFIB would probably be a good place to start (wv/fib.c). It's a straightforward record, and the implementation should be fairly obvious based upon the wvGetFIB function.

The function definition should probably be something like: U16 wvPutFIB(FIB*, wvStream*);

We'll assume that the stream is in the right place to write (ie. the caller seeked the stream already), and errors should be returned via the U16.

This implementation will, of course, be making use of the write_Xubit functions created in part 1b. If part 1b has not been done yet, people could start writing these wvPut* functions while just pretending that the write_Xubit functions actually existed.

Now, there are lots of these, so many people could work on this part, each taking a few types to implement.

To figure out exactly how you're supposed to write the data out, you'll use a combination techniques. First, consult the Word file format documentation.

http://busboy.sped.ukans.edu/~justin/word/

Second, look over the corresponding wvGet function. If the documentation and implementation differ, follow the implementation.

By the time we're done with these functions, we are capable of writing a complete Word document. HOWEVER, we don't have any of the logic to populate all of these Word structs or sequence all of the wvPut* calls so that the data is in the appropriate order and location.

If people are interested in working on this further, I'll write up POWs for the "logic" steps, too.

hints ----- Include debugging trace messages liberally.

wvTrace(("status messages, helpful in tracking things down")); wvWarning("something is strange and might indicate a problem"); wvError(("critical problem, something's defintely wrong"));

These work just like Abi's UT_DEBUGMSG(("%s", "version")); except for wvWarning, which only has one set of parentheses.

wvTrace only show up in DEBUG builds. The others always show up.

Also, the Word 97 format can be found at: http://busboy.sped.ukans.edu/~justin/word/ (scary, isn't it?)

extra credit ------------ If you've gotten this far, and the Word format hasn't driven you insane, here are some ideas for a next step. Just as a warning, these will require a better understanding of the Word format.

a) Write a CHP/PAP/SEP compressor to generate CHPX/PAPX/SEPX SPRMs based on the property's base style.

b) Encode these SPRMs into the appropriate storage structure for the associated exception run (such as FKP BTEs)

c) Write an escher wrapping function to encode bitmaps for storage in the data stream

d) If you want more ideas or more explanation, email me (justin@ukans.edu)

----

PS: For more background on the whole POW / ZAP / SHAZAM concept, see the following introduction: http://www.abisource.com/mailinglists/abiword-dev/99/September/0097.html

Justin Bradford justin@ukans.edu

Next message: sam th: "AWN #6 Final Draft"
Previous message: Justin Bradford: "Re: MSWord file crashes Abi-Word 0.7.8"
Next in thread: James Montgomerie: "Pre-emptive ZAP -- POW -- Beginning the Binary Word Exporter"
Reply: James Montgomerie: "Pre-emptive ZAP -- POW -- Beginning the Binary Word Exporter"

This archive was generated by hypermail 2b25 : Wed Mar 15 2000 - 17:28:28 CST