Subject: Re: detecting file type by magic number
From: Paul Rohr (paul@abisource.com)
Date: Fri Jan 21 2000 - 18:26:55 CST
At 11:44 AM 1/21/00 -0500, Leonard Rosenthol wrote:
> According to the XML spec, an XML document MUST begin with
>(at least) the line above (<?xml version="1.0"?>).
To be pedantic, here's the relevant sentence from the spec:
http://www.w3.org/TR/REC-xml#sec-prolog-dtd
-------------------------------------------
"XML documents may, and should, begin with an XML declaration which
specifies the version of XML being used. [...] The version number "1.0"
should be used to indicate conformance to this version of this
specification; it is an error for a document to use the value "1.0" if it
does not conform to this version of this specification."
Where "may" and "must" are defined as follows:
http://www.w3.org/TR/REC-xml#sec-terminology
--------------------------------------------
may
Conforming documents and XML processors are permitted to but need
not behave as described.
must
Conforming documents and XML processors are required to behave as
described; otherwise they are in error.
Don't get me wrong. I have zero objection to adding a <?xml ...?> line at
the head of our document format. It's a very, very easy patch, and we
*should* do so...
... but until someone explicitly confirms that the documents we produce *do*
in fact conform to this spec, it's an error to claim that we do.
>That line may
>contain other attributes (eg. character set encoding), and may be
>followed by the line that you suggest for specifying where to find
>the relevant DTD (or other schema types).
>
> ONLY after all that "setup" would you find the base element
>tag - <abiword> in this case.
Yep. Exactly. A more robust sniffer algorithm would look at the first
three lines of the file, ignoring any lines which with the following
character sequences:
<?xml
<!doctype
If the next line starts with "<abiword", that's us. Otherwise, this could
be another xml-format document, and the sniffer should refuse to guess.
Yes, we can import abiword documents which fail this sniff test.
Yes, someone could author XML documents which falsely pass this sniff test.
Yes, a "real" XML parser could do a better job than this dumb sniffer.
My claim is that this is probably all we need.
Paul
This archive was generated by hypermail 2b25 : Fri Jan 21 2000 - 18:21:37 CST