Re: What route for the XHTML importer?

From: Dom Lachowicz <domlachowicz_at_yahoo.com>
Date: Tue May 18 2004 - 02:25:13 CEST

We already have a TidyLib and a LibXml2 based XHTML
importer. I think it makes the most sense to use the
LibXML2 one, especially on our Unix builds (since
we'll use libxml2 already).

Just my $0.02,
Dom

--- Ryan Pavlik <abiryan@ryand.net> wrote:
>
> Hubert Figuiere wrote:
>
> >Hi,
> >
> >On Fri, 2004-05-14 at 14:31 +1000, Martin Sevior
> wrote:
> >
> >
> >
> >> As it currently stands the XHTML importer
> is very fragile and
> >>very strict. If a HTML file doesn;t exactly fit
> XHTML spec we barf on
> >>it.
> >>
> >>Now as you all know there are a lot of broken HTML
> files that render
> >>just fine in IE, Mozilla and many other browsers.
> >>
> >>So my question is:
> >>
> >>Should we attempt to import broken HTML files or
> just barf on them and
> >>say "Illegal document"?
> >>
> >>I would MUCH rather attempt to import them as well
> as possible.
> >>
> >>
> >
> >Here is my 0.02 CAD opinion:
> >
> >We should not limit to XHTML ? Why ? Simply because
> Joe Average wants to
> >import "HTML documents" made by crappy software,
> and there is simply too
> >much out there. What about HTML ? We should be as
> permissive as we can.
> >
> >Thing we can assume to fail:
> >-frames
> >-script
> >-HTML markup generated by scripts
> >
> >Thing we must eat:
> >-mixed-case tags
> >-not closed tags
> >-inconsistent tags
> >-tag in the wrong context
> >-some extenstions
> >
> >Parsing HTML is a lot of work. I'd pretty much
> prefer us "stealing" some
> >code from another Free software project, something
> that would come from
> >Lynx, links, w3m, khtml (or Apple's incarnation),
> gtkhtml, etc. Even
> >Mozilla, but I'm not sure it does not bring too
> much.
> >
> >So in short: don't barf too soon on a document.
> >
> >For the test bed, just use wget <sigh>
> >
> >Hub
> >
> >
> >
> >
> Can we somehow use tidylib (from HTML Tidy) to
> pretty up documents into
> XHTML? I've had a fair amount of success using it
> stand-alone to clean
> up some pretty lousy data, and it's good at cleaning
> up exported
> HTML-like stuff from meaningless little applications
> like Word. Even if
> as some sort of pre-processing plugin, it might
> prove a good solution.
> http://tidy.sourceforge.net/
>
> Ryan

        
                
__________________________________
Do you Yahoo!?
SBC Yahoo! - Internet access at a great low price.
http://promo.yahoo.com/sbc/
Received on Tue May 18 02:15:07 2004

This archive was generated by hypermail 2.1.8 : Tue May 18 2004 - 02:15:07 CEST