Re: 5291 - A Task Force

From: Martin Sevior (msevior@seviorpc.ph.unimelb.edu.au)
Date: Thu Sep 25 2003 - 22:14:27 EDT

Next message: Martin Sevior: "commit: Basic XHTML table import."

Previous message: Dom Lachowicz: "Re: 5291 - A Task Force"
In reply to: Dom Lachowicz: "Re: 5291 - A Task Force"
Next in thread: Martin Sevior: "Re: 5291 - A Task Force"
Next in thread: Ted Lemon: "Re: 5291 RTF import slow - the solution :-)"
Reply: Martin Sevior: "Re: 5291 - A Task Force"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

On Fri, 2003-09-26 at 02:13, Dom Lachowicz wrote:
> > 1. Just #ifdef out the comparison in addIfUnique so
> > we just add the
> > strings. This provides no penalty for speed at all.
> > The downside is that
> > the attributes/properties grow linearly with
> > document size and every
> > time some presses a key.
>
> I agree with Tomas. This is a bad idea.
>
> > 2. Be selective about addIfUnique. We don't do the
> > ifUnique upon
> > document loading but keep it during editing. This
> > way the document
> > doesn't grow any bigger than needed during editing
> > but we keep the fast
> > load.
>
> This sounds better than the above, but is probably a
> bad idea for similar reasons that Tomas gave to the
> above.

I have played with this a bit. It has the downside of doing more lookups
during editting but the loading is much faster and is really easy to
implement. I'm about to commit a patch that implements this as the
default until we come up with something better. Feel free to flame we
with quantitative reasons (ie some memory usage statistics before and
after) why we should not go this route.

(I'm not doing this because I think it's best, only because it's better
than what we have now.)
>
> > 3. Have a cache of most recently used
> > Attribute/Properties. We do a
> > comparison with the properties in the cache and add
> > them if the
> > properties are not in the cache. This gives us a
> > constant penalty with
> > some trade-off of document size versus speed.
>
> I think that this is going to hurt us
> performance-wise. I don't see how one can make the
> assumption that one is likely to hit duplicate styles
> within short spans of time. Even still, we'd still
> need a method to compare the attributes, and a
> mechanism for passivating parts of the cache. I
> haven't done any thorough testing on this, but it
> seems to have all of the above flaws plus some new
> ones of its own.
>

I think a simple buffer with a couple of heuristics could do rather
well. We could play with the cache size to see how much effect that has.

The heuristics would be: Every time an Att/Prop that does not find a
match in the cache it replaces a Att/Prop in the cache and gets a new
indexAP.

Att/Props in the cache which get matched the score count incremented by
one.

The oldest Att/Prop with the lowest hit count gets discarded by the new
Att/Prop.

This would be simple and fast with a constant speed penalty (ie
independent on n).

> > 4. Mike Nordell's suggestion of string => integer
> > translation. I was
> > thinking of the same thing. Doing a MD5 sum might be
> > overkill but
> > something like that would guarantee a unique
> > integer/string key. The
> > integer keys could be stored in a binary vector
> > which could be searched
> > at order log(n) time. If we're clever with the data
> > structures we might
> > be able to do insertions into the array at order
> > log(n) time too.
>
> MD5Sums and integer IDs break down quickly in certain
> cases. They're acceptable if we define the granularity
> on an appropriate level, and make some other
> hard-and-set decisions. Currently, all these things
> look alike to the PT, and are stored in the same
> PP_Attribute:
>
> "font:Arial;size:12pt"
> "size:12pt;font:Arial"
> "font:Arial ;size:12pt"
>

Currently we treat all these as unique AP's. My suggestion for this
technique would continue to treat these are different even though as you
say, they're sematically the same. My GUESS is that it is not worth the
penalty to pull out the semantic information and order the props
accordingly. However I could be wrong and would be happy to see some
test data to show otherwise.

> Do we want all of these treated the same?
> Semantically, they're all identical. As far as strcmp,
> MD5, etc... goes, they're all radically different
> strings.
>

Yes.

> If we want to treat these things as different, then
> using integer IDs is easy. There's no need for MD5sums
> - 'i++;' will do quite nicely.

I don't get this. My proposal would be that every new new string gets a
one computation to turn it into a unique integer.

eg strings from the importers appear in this order and get translated
into a unique integer
"font:Arial;size:12pt" => 2134563234
"size:12pt;font:Arial" => 3124563234
"font:Arial ;size:12pt" => 9234567234
"find:Times New Roman; size:14pt" => 89345673234
"font:Arial;size:12pt" => 2134563234

Then instead of comparing strings we simply compare the keyed-integer's.
If the integers match we have identical strings. This comparison is
faster than string matching and has the additional advantage of allowing
the strings to be ordered allowing a binary comparison or placed in a
hash table.

> However, if we treat
> these as different, the memory usage will likely go
> up. By how much (or little) remains to be seen.
>
> Finally, these suggestions, while productive, might
> not be the right place to solve this problem. We have
> quadratic behavior in our comparison routine. We must
> perform 47,000,000 comparisons from something that's
> only called 27,000 times. Yes, strcmp is a little
> wasteful. Is the solution to replace strcmp with an
> integer comparison, or to find a solution that doesn't
> require calling strcmp (or an equivalent integer
> comparison) 47 million times?
>

The above suggestions would all do this. (Well suggestion 2 would only
solve the comparison problem during import.)

> Discuss amongst yourselves.
>

We are :-)

Cheers

Martin

Next message: Martin Sevior: "commit: Basic XHTML table import."
Previous message: Dom Lachowicz: "Re: 5291 - A Task Force"
In reply to: Dom Lachowicz: "Re: 5291 - A Task Force"
Next in thread: Martin Sevior: "Re: 5291 - A Task Force"
Next in thread: Ted Lemon: "Re: 5291 RTF import slow - the solution :-)"
Reply: Martin Sevior: "Re: 5291 - A Task Force"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Thu Sep 25 2003 - 21:27:32 EDT