Re: 5291 - A Task Force

From: Martin Sevior (msevior@seviorpc.ph.unimelb.edu.au)
Date: Fri Sep 26 2003 - 00:26:54 EDT

  • Next message: Martin Sevior: "commit: crc32 class."

    On Fri, 2003-09-26 at 12:14, Martin Sevior wrote:
    > On Fri, 2003-09-26 at 02:13, Dom Lachowicz wrote:
    > > > 1. Just #ifdef out the comparison in addIfUnique so
    > > > we just add the
    > > > strings. This provides no penalty for speed at all.
    > > > The downside is that
    > > > the attributes/properties grow linearly with
    > > > document size and every
    > > > time some presses a key.
    > >
    > > I agree with Tomas. This is a bad idea.
    > >
    > > > 2. Be selective about addIfUnique. We don't do the
    > > > ifUnique upon
    > > > document loading but keep it during editing. This
    > > > way the document
    > > > doesn't grow any bigger than needed during editing
    > > > but we keep the fast
    > > > load.
    > >
    > > This sounds better than the above, but is probably a
    > > bad idea for similar reasons that Tomas gave to the
    > > above.
    >
    > I have played with this a bit. It has the downside of doing more lookups
    > during editting but the loading is much faster and is really easy to
    > implement. I'm about to commit a patch that implements this as the
    > default until we come up with something better. Feel free to flame we
    > with quantitative reasons (ie some memory usage statistics before and
    > after) why we should not go this route.
    >
    > (I'm not doing this because I think it's best, only because it's better
    > than what we have now.)

    Dom told me I should post statistics first. Then commit if OK.

    I loaded the RTF-1.7 specificiation (in RTF form). I had CVS abiword
    compiled with full optimizations.

    Without the speed up patch, the memory consumption was 72 Megabytes.
    With this simple speed up patch, the memory consumption was 102
    Megabytes.

    This 30% increase in memory usage is significant. I think Dom and I
    decided to go down the route described in 4 using crc32 checksums to
    give us unique keys to the strings.

    Cheers

    Martin
    > >
    > > > 3. Have a cache of most recently used
    > > > Attribute/Properties. We do a
    > > > comparison with the properties in the cache and add
    > > > them if the
    > > > properties are not in the cache. This gives us a
    > > > constant penalty with
    > > > some trade-off of document size versus speed.
    > >
    > > I think that this is going to hurt us
    > > performance-wise. I don't see how one can make the
    > > assumption that one is likely to hit duplicate styles
    > > within short spans of time. Even still, we'd still
    > > need a method to compare the attributes, and a
    > > mechanism for passivating parts of the cache. I
    > > haven't done any thorough testing on this, but it
    > > seems to have all of the above flaws plus some new
    > > ones of its own.
    > >
    >
    > I think a simple buffer with a couple of heuristics could do rather
    > well. We could play with the cache size to see how much effect that has.
    >
    > The heuristics would be: Every time an Att/Prop that does not find a
    > match in the cache it replaces a Att/Prop in the cache and gets a new
    > indexAP.
    >
    > Att/Props in the cache which get matched the score count incremented by
    > one.
    >
    > The oldest Att/Prop with the lowest hit count gets discarded by the new
    > Att/Prop.
    >
    > This would be simple and fast with a constant speed penalty (ie
    > independent on n).
    >
    > > > 4. Mike Nordell's suggestion of string => integer
    > > > translation. I was
    > > > thinking of the same thing. Doing a MD5 sum might be
    > > > overkill but
    > > > something like that would guarantee a unique
    > > > integer/string key. The
    > > > integer keys could be stored in a binary vector
    > > > which could be searched
    > > > at order log(n) time. If we're clever with the data
    > > > structures we might
    > > > be able to do insertions into the array at order
    > > > log(n) time too.
    > >
    > > MD5Sums and integer IDs break down quickly in certain
    > > cases. They're acceptable if we define the granularity
    > > on an appropriate level, and make some other
    > > hard-and-set decisions. Currently, all these things
    > > look alike to the PT, and are stored in the same
    > > PP_Attribute:
    > >
    > > "font:Arial;size:12pt"
    > > "size:12pt;font:Arial"
    > > "font:Arial ;size:12pt"
    > >
    >
    > Currently we treat all these as unique AP's. My suggestion for this
    > technique would continue to treat these are different even though as you
    > say, they're sematically the same. My GUESS is that it is not worth the
    > penalty to pull out the semantic information and order the props
    > accordingly. However I could be wrong and would be happy to see some
    > test data to show otherwise.
    >
    > > Do we want all of these treated the same?
    > > Semantically, they're all identical. As far as strcmp,
    > > MD5, etc... goes, they're all radically different
    > > strings.
    > >
    >
    > Yes.
    >
    > > If we want to treat these things as different, then
    > > using integer IDs is easy. There's no need for MD5sums
    > > - 'i++;' will do quite nicely.
    >
    > I don't get this. My proposal would be that every new new string gets a
    > one computation to turn it into a unique integer.
    >
    > eg strings from the importers appear in this order and get translated
    > into a unique integer
    > "font:Arial;size:12pt" => 2134563234
    > "size:12pt;font:Arial" => 3124563234
    > "font:Arial ;size:12pt" => 9234567234
    > "find:Times New Roman; size:14pt" => 89345673234
    > "font:Arial;size:12pt" => 2134563234
    >
    > Then instead of comparing strings we simply compare the keyed-integer's.
    > If the integers match we have identical strings. This comparison is
    > faster than string matching and has the additional advantage of allowing
    > the strings to be ordered allowing a binary comparison or placed in a
    > hash table.
    >
    > > However, if we treat
    > > these as different, the memory usage will likely go
    > > up. By how much (or little) remains to be seen.
    > >
    > > Finally, these suggestions, while productive, might
    > > not be the right place to solve this problem. We have
    > > quadratic behavior in our comparison routine. We must
    > > perform 47,000,000 comparisons from something that's
    > > only called 27,000 times. Yes, strcmp is a little
    > > wasteful. Is the solution to replace strcmp with an
    > > integer comparison, or to find a solution that doesn't
    > > require calling strcmp (or an equivalent integer
    > > comparison) 47 million times?
    > >
    >
    > The above suggestions would all do this. (Well suggestion 2 would only
    > solve the comparison problem during import.)
    >
    > > Discuss amongst yourselves.
    > >
    >
    > We are :-)
    >
    > Cheers
    >
    > Martin
    >
    >



    This archive was generated by hypermail 2.1.4 : Thu Sep 25 2003 - 23:40:01 EDT