HTML importer plans...

michael@surfnetcity.com.au
Thu, 25 Mar 1999 13:00:56 +1100


Ok, first I'd like to say a few things, before I go technical..

I make no promise that I'll finish this, or even get enough code to work with.
Sorry, but time is limited for me, and the only reason I'm considering it is
because I've been wanting to do an HTML importer library for a while.

Ok, now on with the show...

Here are my plans for an HTML importer, this will include some software design
issues, that I would like opinions of too.

Headers: (Everything within the <HEAD> tag)

Basically, I'll leave the headers alone. I might save them into memory for a
HTML export, but other than that, they're not needed for AbiWord. The BODY tag
will automatically end the <HEAD> section, even if no </HEAD> is found.

Body:

Ok, this is everything from the start of the <BODY> tag, until the </BODY>,
, or EOF.

My importer will probably be based on HTML 3.2. Anything non-3.2 will be
ignored, except for under specail circumstances (<P> comes to mind...).

I will handle tags, by having a set of flags for recognised tags, and some
char * pointers to store arguments to tags, which will either be NULL if there
were no arguments, or contain everything after the tag part. Even if I don't
handle these at first, the architecture will (parcially) be there for it.

Any tags that re-appear before closing the tag will take preference over the
last one. This will be semi-useful for the A tag, and downright annoying for
the <FONT> tag. The ones that fit in the second category will have to be a
special case, and extra code will have to be added later.

Unrecognised tags will be ignored, and stupid colour settings will go back to
Mosaicesque defaults. If people can't get it right manually, it will be
ignored by my importer. (Or honoured with a segfault)

Comments?

-- 
-- Michael Samuel <michael@surfnetcity.com.au>

The only person here stupid enough to even consider an HTML importer.



This archive was generated by hypermail 1.03b2.