Re: detecting file type by magic number


Subject: Re: detecting file type by magic number
From: Paul Rohr (paul@abisource.com)
Date: Fri Jan 21 2000 - 18:26:55 CST


At 11:44 AM 1/21/00 -0500, Leonard Rosenthol wrote:
> According to the XML spec, an XML document MUST begin with
>(at least) the line above (<?xml version="1.0"?>).

To be pedantic, here's the relevant sentence from the spec:

  http://www.w3.org/TR/REC-xml#sec-prolog-dtd
  -------------------------------------------
  "XML documents may, and should, begin with an XML declaration which
   specifies the version of XML being used. [...] The version number "1.0"
   should be used to indicate conformance to this version of this
   specification; it is an error for a document to use the value "1.0" if it
   does not conform to this version of this specification."

Where "may" and "must" are defined as follows:

  http://www.w3.org/TR/REC-xml#sec-terminology
  --------------------------------------------
  may
    Conforming documents and XML processors are permitted to but need
    not behave as described.

  must
    Conforming documents and XML processors are required to behave as
    described; otherwise they are in error.

Don't get me wrong. I have zero objection to adding a <?xml ...?> line at
the head of our document format. It's a very, very easy patch, and we
*should* do so...

... but until someone explicitly confirms that the documents we produce *do*
in fact conform to this spec, it's an error to claim that we do.

>That line may
>contain other attributes (eg. character set encoding), and may be
>followed by the line that you suggest for specifying where to find
>the relevant DTD (or other schema types).
>
> ONLY after all that "setup" would you find the base element
>tag - <abiword> in this case.

Yep. Exactly. A more robust sniffer algorithm would look at the first
three lines of the file, ignoring any lines which with the following
character sequences:

  <?xml
  <!doctype

If the next line starts with "<abiword", that's us. Otherwise, this could
be another xml-format document, and the sniffer should refuse to guess.

Yes, we can import abiword documents which fail this sniff test.
Yes, someone could author XML documents which falsely pass this sniff test.
Yes, a "real" XML parser could do a better job than this dumb sniffer.

My claim is that this is probably all we need.

Paul



This archive was generated by hypermail 2b25 : Fri Jan 21 2000 - 18:21:37 CST