Re: Importing HTML


Subject: Re: Importing HTML
From: Alan Horkan (horkana@tcd.ie)
Date: Tue Apr 24 2001 - 11:15:17 CDT


Dom Lachowicz wrote:
>
> >Would it be hard to allow non well-formed HTML to be imported ?
> >If not, how hard would it be to provide return on the syntax error
> >so that user can fix it and try again ?
>
> Yes, this would be very hard. Currently, we base our HTML importer on our
> XML importer class, which means that the HTML must be well-formed. Writing a
> parser for HTML isn't high on my priority list, though, esp. with all of the
> nastiness that the browsers allow you to do. XHTML deprecates a lot of the
> tags (yay!) and is *very* strict about what can appear where
> (well-formedness).
>
> It shouldn't be too hard to propegate a more-descriptive error message
> upstream, however.
>

Any chance of a partial/lossy import, ignore all unknown tags, dump all
unmatched tags ...???

More simply, what im trying to suggest is, "Error: this document
contains
invalid HTML. Would you like to import it as plain text with line
breaks"

Please :)

or as a temporary measure we could recommend a HTML validator like:
http://validator.w3.org/

-- 
   ~     
  |v|    
 // \\   
/(   )\  
 ^`~'^



This archive was generated by hypermail 2b25 : Tue Apr 24 2001 - 11:15:21 CDT