abiword-dev Mailing List Archive: Re: Commit: PDF import plugin

From: <msevior_at_physics.unimelb.edu.au>
Date: Fri Mar 18 2005 - 02:45:29 CET

>
>> I'm just wondering if it would be valid to ask the
>> Poppler user
>> community to help develop PDF->RTF, instead of just
>> you (or other
>> AbiFolk) working on making the PDF->TXT support what
>> we want or
>> developing a custom PDF->ABW.
>
> I don't think that the Poppler folks are interested in
> this use-case, other than "wouldn't it be cool if
> someone else did it". But I could be wrong if someone
> wants to ask them. I can see the KWord and OOo teams
> caring, though.
>
> As for mapping PDF to RTF, PDF doesn't have any
> semantic information to speak of that word processor
> formats care about. It's just a vector graphic that
> uses a brain-damaged format. Not that SVG is all that
> much saner...
>
> The PDF->TEXT converter tried to reconstruct at least
> the logical text order. I can certainly help produce a
> PDF->RTF converter by stealing liberally from the
> PDF->TEXT converter. The KWord folk would probably be
> interested in it. With a bit of sneaky-ness, it should
> be possible to largely preserve the following:
>
> *) Font names
> *) Font sizes
> *) Font styles (bold, italic)
> *) Images
> *) Certain annotations, preserved as RTF comments
> *) Colors
>
> By doing multiple passes + some hueristics, it may be
> possible to reconstruct some semantic layouts (eg
> columns or sections), but I don't imagine this
> happening anytime soon.
>

Hi Dom,
It certainly sounds like a long term project. Regarding our RTF
import - it's basically at the level of our *.abw modulo some
off-by-one paragraph errors and Birch has been giving us great help
recently getting a number of niggling details right. We certainly
support text boxes and positioned objects.

Let be know if you need help importing images via the PDF importer. We do
positioned images quite well now so in principle we should be able to map
a PDF image at location (x,y) on a page to the same location in an abiword
document.

The heuristics with columns etc sounds interesting. If we can work out
heuristically where paragraphs end, the font and fonts sizes, column
widths etc. we may be able to recover a reasonable fraction of the
document semantics.

- But is a long term project as you say.

Great Work BTW! I've wanted this for a long time :-)

Cheers

Martin

> Best,
> Dom
>
>
>
> __________________________________
> Do you Yahoo!?
> Yahoo! Small Business - Try our new resources site!
> http://smallbusiness.yahoo.com/resources/
>
Received on Fri Mar 18 02:46:11 2005

This archive was generated by hypermail 2.1.8 : Fri Mar 18 2005 - 02:46:11 CET