Re: Light at the end of the tunnel Was Re: 5291 - A Task Force

From: Dom Lachowicz (domlachowicz@yahoo.com)
Date: Fri Sep 26 2003 - 15:52:11 EDT

  • Next message: Johnny Lee: "Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"

    This is almost exactly what Martin and I were planning
    to do (time permitting).

    I do have some questions that you may or may not be
    able to answer:

    1) Before vs. After memory usage
    2) What're the mathematical chances for the Glib
    string hashing algorithm to produce a duplicate hash
    given different data? I'm guessing that it's much
    higher than a CRC32 or MD5/SHA1 sum, given that
    hashtables often dynamically resize or chain buckets
    together when they encounter a collision. Note that
    fixing this just means using a better checksum
    generator.

    Other than that, this looks like a superb idea. It
    gets rid of the O(n^2) behavior AND removes the bulk
    of the strcmps. Well done.

    Dom

    --- Johnny Lee <typo_pl@hotmail.com> wrote:
    > I took a look at this problem and have come up with
    > a rough solution.
    >
    > The changes need some cleanup though.
    >
    > My changes reduce the import time (measured from
    > when I click Open on the
    > Open File dialog box until 'Importing Document...'
    > disappears from the
    > status bar) from ~120 seconds to ~22 seconds.
    >
    > The document spends at keast another 40 seconds
    > building the document
    > though.
    >
    > Main changes:
    > - better checksum (aka hashcode) to help
    > differentiate between AP objects.
    > - sort the AP objects so we can use binary search
    >
    > Description and patches can be found at
    > <http://www.geocities.com/typopl/bug5291.html>.
    >
    > J
    >
    >
    > >From: Martin Sevior
    > <msevior@seviorpc.ph.unimelb.edu.au>
    > >Reply-To: msevior@physics.unimelb.edu.au
    > >To: msevior@physics.unimelb.edu.au
    > >CC: Dom Lachowicz <domlachowicz@yahoo.com>,AbiWord
    > ><abiword-dev@abisource.com>
    > >Subject: Re: 5291 - A Task Force
    > >Date: Fri, 26 Sep 2003 14:26:54 +1000
    > >
    > >On Fri, 2003-09-26 at 12:14, Martin Sevior wrote:
    > > > On Fri, 2003-09-26 at 02:13, Dom Lachowicz
    > wrote:
    > > > > > 1. Just #ifdef out the comparison in
    > addIfUnique so
    > > > > > we just add the
    > > > > > strings. This provides no penalty for speed
    > at all.
    > > > > > The downside is that
    > > > > > the attributes/properties grow linearly with
    > > > > > document size and every
    > > > > > time some presses a key.
    > > > >
    > > > > I agree with Tomas. This is a bad idea.
    > > > >
    > > > > > 2. Be selective about addIfUnique. We don't
    > do the
    > > > > > ifUnique upon
    > > > > > document loading but keep it during editing.
    > This
    > > > > > way the document
    > > > > > doesn't grow any bigger than needed during
    > editing
    > > > > > but we keep the fast
    > > > > > load.
    > > > >
    > > > > This sounds better than the above, but is
    > probably a
    > > > > bad idea for similar reasons that Tomas gave
    > to the
    > > > > above.
    > > >
    > > > I have played with this a bit. It has the
    > downside of doing more lookups
    > > > during editting but the loading is much faster
    > and is really easy to
    > > > implement. I'm about to commit a patch that
    > implements this as the
    > > > default until we come up with something better.
    > Feel free to flame we
    > > > with quantitative reasons (ie some memory usage
    > statistics before and
    > > > after) why we should not go this route.
    > > >
    > > > (I'm not doing this because I think it's best,
    > only because it's better
    > > > than what we have now.)
    > >
    > >
    > >Dom told me I should post statistics first. Then
    > commit if OK.
    > >
    > >I loaded the RTF-1.7 specificiation (in RTF form).
    > I had CVS abiword
    > >compiled with full optimizations.
    > >
    > >Without the speed up patch, the memory consumption
    > was 72 Megabytes.
    > >With this simple speed up patch, the memory
    > consumption was 102
    > >Megabytes.
    > >
    > >This 30% increase in memory usage is significant. I
    > think Dom and I
    > >decided to go down the route described in 4 using
    > crc32 checksums to
    > >give us unique keys to the strings.
    > >
    > >Cheers
    > >
    > >Martin
    > > > >
    > > > > > 3. Have a cache of most recently used
    > > > > > Attribute/Properties. We do a
    > > > > > comparison with the properties in the cache
    > and add
    > > > > > them if the
    > > > > > properties are not in the cache. This gives
    > us a
    > > > > > constant penalty with
    > > > > > some trade-off of document size versus
    > speed.
    > > > >
    > > > > I think that this is going to hurt us
    > > > > performance-wise. I don't see how one can make
    > the
    > > > > assumption that one is likely to hit duplicate
    > styles
    > > > > within short spans of time. Even still, we'd
    > still
    > > > > need a method to compare the attributes, and a
    > > > > mechanism for passivating parts of the cache.
    > I
    > > > > haven't done any thorough testing on this, but
    > it
    > > > > seems to have all of the above flaws plus some
    > new
    > > > > ones of its own.
    > > > >
    > > >
    > > > I think a simple buffer with a couple of
    > heuristics could do rather
    > > > well. We could play with the cache size to see
    > how much effect that has.
    > > >
    > > > The heuristics would be: Every time an Att/Prop
    > that does not find a
    > > > match in the cache it replaces a Att/Prop in the
    > cache and gets a new
    > > > indexAP.
    > > >
    > > > Att/Props in the cache which get matched the
    > score count incremented by
    > > > one.
    > > >
    > > > The oldest Att/Prop with the lowest hit count
    > gets discarded by the new
    > > > Att/Prop.
    > > >
    > > > This would be simple and fast with a constant
    > speed penalty (ie
    > > > independent on n).
    > > >
    > > > > > 4. Mike Nordell's suggestion of string =>
    > integer
    > > > > > translation. I was
    > > > > > thinking of the same thing. Doing a MD5 sum
    > might be
    > > > > > overkill but
    > > > > > something like that would guarantee a unique
    > > > > > integer/string key. The
    > > > > > integer keys could be stored in a binary
    > vector
    > > > > > which could be searched
    > > > > > at order log(n) time. If we're clever with
    > the data
    > > > > > structures we might
    > > > > > be able to do insertions into the array at
    > order
    > > > > > log(n) time too.
    > > > >
    > > > > MD5Sums and integer IDs break down quickly in
    > certain
    > > > > cases. They're acceptable if we define the
    > granularity
    > > > > on an appropriate level, and make some other
    > > > > hard-and-set decisions. Currently, all these
    > things
    > > > > look alike to the PT, and are stored in the
    > same
    > > > > PP_Attribute:
    > > > >
    > > > > "font:Arial;size:12pt"
    > > > > "size:12pt;font:Arial"
    > > > > "font:Arial ;size:12pt"
    > > > >
    > > >
    > > > Currently we treat all these as unique AP's. My
    > suggestion for this
    > > > technique would continue to treat these are
    > different even though as you
    > > > say, they're sematically the same. My GUESS is
    > that it is not worth the
    > > > penalty to pull out the semantic information and
    > order the props
    > > > accordingly. However I could be wrong and would
    > be
    === message truncated ===

    __________________________________
    Do you Yahoo!?
    The New Yahoo! Shopping - with improved product search
    http://shopping.yahoo.com



    This archive was generated by hypermail 2.1.4 : Fri Sep 26 2003 - 16:07:28 EDT