Justin,
Just to underscore what Shaw said -- our current .doc importer is a
placeholder that simply locates a text stream, rips it out and divides it up
into blocks. It does allow a useful peek into a subset of existing
documents, but dies on many others, including fastsaved ones.
Sounds like you're all set on sources for the OLE/Word side of the picture.
As you've probably noticed from a quick scan of the abi/src/wp/impexp/xp
sources, the in-memory format you'll be converting to/from is pretty
straightforward.
The basic idea is that most formatting information is expressed in strings
as CSS-like properties hanging off a section, block, object, or span. It's
a fairly general mechanism, which has made it easy for us to minimize
import/export headaches as we scale up our formatter implementation. We've
known all along that the Word file formats contain more info than we
currently support, so as you run across them, we'll add 'em.
Also, note that our in-memory format is *astonishingly* similar to our
native file format. :-) There is a draft document describing the intended
file format in abi/docs, but it's out of date again, so you're probably best
off reverse engineering the abi/src/wp/samples instead. Those do stay
up-to-date.
Hopefully that's enough to get you rolling. If not, let us know.
Paul