Re: msword doc bug


Subject: Re: msword doc bug
From: Justin Bradford (justin@ukans.edu)
Date: Wed Jan 19 2000 - 13:04:10 CST


> Supposedly the Word 97 format spec is availible in the MSDN library
> (availible at msdn.microsoft.com) but you need a working version of IE,
> with microsoft java to find it. As I do not have any of these, if someone
> who does would like to find the document and post the url, or just find
> out what it specifies the first bytes should be, that would make this job
> much easier.

http://busboy.sped.ukans.edu/~justin/word/wword8.html

The only problem is that the guts of a Word file are inside an OLE2
structured storage object (since version 6, at least). So the beginning of
any MS Office document looks the same. To figure out if it's really a
.doc, you'd need to decode some of the OLE stuff, extract the Summary
Info, parse it, and then return true/false.

Luckily, Caolan's already done this. I know he submitted code to do the
above for all Office files to the 'file' (which tells you what's in files
under unix) maintainers. I'm not sure if the code is in file yet, but I
imagine Caolan would be willing to submit the code here (or possibly add
it as a function in wv). For 'magic number'-type tests, file already does
most of them, so we could borrow code/ideas from it.

For instance, the old version I have recognizes RTF already. It's confused
by abiword files (returning either English text, ASCII text, or exported
SGML document text), depending on the charsets used and whether the doc
starts with comments.

You can find file at:
ftp://ftp.astron.com/pub/file/file-3.28.tar.gz

Justin



This archive was generated by hypermail 2b25 : Wed Jan 19 2000 - 13:04:23 CST