Re: Current CVS version

Paul Rohr (paul@abisource.com)
Fri, 12 Feb 1999 11:36:21 -0800


At 01:20 PM 2/12/99 -0600, Shaw Terwilliger wrote:
>Aah, thanks for refreshing my memory. It has been a while
>since I've checked up on MSWordView or the OLE library
>underneath. As you said, decoding OLE streams is just
>the precursor to reading a Word-formatted file; the second
>is a much more laborious task. We have the very beginnings
>of parsing out the file in the tree, but I'm not expecting
>that we continue that code base if we have a much cleaner
>way to (1) parse the OLE2 file, and (2) serialize the
>Word formatting and content into our internal document
>structure.

Justin,

Just to underscore what Shaw said -- our current .doc importer is a
placeholder that simply locates a text stream, rips it out and divides it up
into blocks. It does allow a useful peek into a subset of existing
documents, but dies on many others, including fastsaved ones.

Sounds like you're all set on sources for the OLE/Word side of the picture.
As you've probably noticed from a quick scan of the abi/src/wp/impexp/xp
sources, the in-memory format you'll be converting to/from is pretty
straightforward.

The basic idea is that most formatting information is expressed in strings
as CSS-like properties hanging off a section, block, object, or span. It's
a fairly general mechanism, which has made it easy for us to minimize
import/export headaches as we scale up our formatter implementation. We've
known all along that the Word file formats contain more info than we
currently support, so as you run across them, we'll add 'em.

Also, note that our in-memory format is *astonishingly* similar to our
native file format. :-) There is a draft document describing the intended
file format in abi/docs, but it's out of date again, so you're probably best
off reverse engineering the abi/src/wp/samples instead. Those do stay
up-to-date.

Hopefully that's enough to get you rolling. If not, let us know.

Paul



This archive was generated by hypermail 1.03b2.