Re: Light at the end of the tunnel Was Re: 5291 - A Task Force

From: Johnny Lee (typo_pl@hotmail.com)
Date: Fri Sep 26 2003 - 16:32:06 EDT

  • Next message: Hubert Figuiere: "Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"

    Comments inline below.

    >From: Dom Lachowicz <domlachowicz@yahoo.com>
    >To: AbiWord <abiword-dev@abisource.com>
    >Subject: Re: Light at the end of the tunnel Was Re: 5291 - A Task Force
    >Date: Fri, 26 Sep 2003 12:52:11 -0700 (PDT)
    >
    >This is almost exactly what Martin and I were planning
    >to do (time permitting).
    >
    >I do have some questions that you may or may not be
    >able to answer:
    >
    >1) Before vs. After memory usage

    Besides the increase due to actual code changes,

    - there's another vector in the pp_TableAttrProp class. this sorted vector
    is the same size as the main vector in pp_TableAttrProp. 4n +20 bytes, where
    n is the number of elems in the vector.

    - the pp_AttrProp class increases by 4 bytes (the m_index field). This
    should probably be moved into the sorted vector table in pp_TableAttrProp. I
    wanted to get the changes out before the weekend.

    So my estimate in mem size usage is that there's ~8n+20 more bytes used.

    The RTF doc had < 10,000 APs I think. So I'd guess <100K for the RTF doc.

    When I checked the memory usage of the old and new Abiword exes in windows
    task manager, the old one was higher by a couple of MBs. Probably due to
    using the mem heap more.

    >2) What're the mathematical chances for the Glib
    >string hashing algorithm to produce a duplicate hash
    >given different data? I'm guessing that it's much
    >higher than a CRC32 or MD5/SHA1 sum, given that
    >hashtables often dynamically resize or chain buckets
    >together when they encounter a collision. Note that
    >fixing this just means using a better checksum
    >generator.

    I don't know how often you'd get duplicate hash values.
    The code doesn't even hash every bit of data.
    It only looks at up to 8 bytes of each attrib, property, and value string.

    When it finds an AP with a matching hash value, it uses that index as the
    starting pt for a linear search, stopping when an exact match is found or
    the AP-to-check has a different hash value.

    As long as there aren't too many duplicates, then the code shouldn't
    degenerate into the old case.

    I didn't measure how many hash dupes there were in the RTF doc, but if you
    look at the number of calls to PP_AttrProp::isExactMatch from
    pp_TableAttrProp::findMatch in the profile results for the new Abiword
    (52284 vs 27492), there aren't many dupes for the APs that Abiword was
    looking for, i.e. < 2 on avg.

    J

    >
    >Other than that, this looks like a superb idea. It
    >gets rid of the O(n^2) behavior AND removes the bulk
    >of the strcmps. Well done.
    >
    >Dom
    >
    >--- Johnny Lee <typo_pl@hotmail.com> wrote:
    > > I took a look at this problem and have come up with
    > > a rough solution.
    > >
    > > The changes need some cleanup though.
    > >
    > > My changes reduce the import time (measured from
    > > when I click Open on the
    > > Open File dialog box until 'Importing Document...'
    > > disappears from the
    > > status bar) from ~120 seconds to ~22 seconds.
    > >
    > > The document spends at keast another 40 seconds
    > > building the document
    > > though.
    > >
    > > Main changes:
    > > - better checksum (aka hashcode) to help
    > > differentiate between AP objects.
    > > - sort the AP objects so we can use binary search
    > >
    > > Description and patches can be found at
    > > <http://www.geocities.com/typopl/bug5291.html>.
    > >
    > > J
    > >
    > >
    > > >From: Martin Sevior
    > > <msevior@seviorpc.ph.unimelb.edu.au>
    > > >Reply-To: msevior@physics.unimelb.edu.au
    > > >To: msevior@physics.unimelb.edu.au
    > > >CC: Dom Lachowicz <domlachowicz@yahoo.com>,AbiWord
    > > ><abiword-dev@abisource.com>
    > > >Subject: Re: 5291 - A Task Force
    > > >Date: Fri, 26 Sep 2003 14:26:54 +1000
    > > >
    > > >On Fri, 2003-09-26 at 12:14, Martin Sevior wrote:
    > > > > On Fri, 2003-09-26 at 02:13, Dom Lachowicz
    > > wrote:
    > > > > > > 1. Just #ifdef out the comparison in
    > > addIfUnique so
    > > > > > > we just add the
    > > > > > > strings. This provides no penalty for speed
    > > at all.
    > > > > > > The downside is that
    > > > > > > the attributes/properties grow linearly with
    > > > > > > document size and every
    > > > > > > time some presses a key.
    > > > > >
    > > > > > I agree with Tomas. This is a bad idea.
    > > > > >
    > > > > > > 2. Be selective about addIfUnique. We don't
    > > do the
    > > > > > > ifUnique upon
    > > > > > > document loading but keep it during editing.
    > > This
    > > > > > > way the document
    > > > > > > doesn't grow any bigger than needed during
    > > editing
    > > > > > > but we keep the fast
    > > > > > > load.
    > > > > >
    > > > > > This sounds better than the above, but is
    > > probably a
    > > > > > bad idea for similar reasons that Tomas gave
    > > to the
    > > > > > above.
    > > > >
    > > > > I have played with this a bit. It has the
    > > downside of doing more lookups
    > > > > during editting but the loading is much faster
    > > and is really easy to
    > > > > implement. I'm about to commit a patch that
    > > implements this as the
    > > > > default until we come up with something better.
    > > Feel free to flame we
    > > > > with quantitative reasons (ie some memory usage
    > > statistics before and
    > > > > after) why we should not go this route.
    > > > >
    > > > > (I'm not doing this because I think it's best,
    > > only because it's better
    > > > > than what we have now.)
    > > >
    > > >
    > > >Dom told me I should post statistics first. Then
    > > commit if OK.
    > > >
    > > >I loaded the RTF-1.7 specificiation (in RTF form).
    > > I had CVS abiword
    > > >compiled with full optimizations.
    > > >
    > > >Without the speed up patch, the memory consumption
    > > was 72 Megabytes.
    > > >With this simple speed up patch, the memory
    > > consumption was 102
    > > >Megabytes.
    > > >
    > > >This 30% increase in memory usage is significant. I
    > > think Dom and I
    > > >decided to go down the route described in 4 using
    > > crc32 checksums to
    > > >give us unique keys to the strings.
    > > >
    > > >Cheers
    > > >
    > > >Martin
    > > > > >
    > > > > > > 3. Have a cache of most recently used
    > > > > > > Attribute/Properties. We do a
    > > > > > > comparison with the properties in the cache
    > > and add
    > > > > > > them if the
    > > > > > > properties are not in the cache. This gives
    > > us a
    > > > > > > constant penalty with
    > > > > > > some trade-off of document size versus
    > > speed.
    > > > > >
    > > > > > I think that this is going to hurt us
    > > > > > performance-wise. I don't see how one can make
    > > the
    > > > > > assumption that one is likely to hit duplicate
    > > styles
    > > > > > within short spans of time. Even still, we'd
    > > still
    > > > > > need a method to compare the attributes, and a
    > > > > > mechanism for passivating parts of the cache.
    > > I
    > > > > > haven't done any thorough testing on this, but
    > > it
    > > > > > seems to have all of the above flaws plus some
    > > new
    > > > > > ones of its own.
    > > > > >
    > > > >
    > > > > I think a simple buffer with a couple of
    > > heuristics could do rather
    > > > > well. We could play with the cache size to see
    > > how much effect that has.
    > > > >
    > > > > The heuristics would be: Every time an Att/Prop
    > > that does not find a
    > > > > match in the cache it replaces a Att/Prop in the
    > > cache and gets a new
    > > > > indexAP.
    > > > >
    > > > > Att/Props in the cache which get matched the
    > > score count incremented by
    > > > > one.
    > > > >
    > > > > The oldest Att/Prop with the lowest hit count
    > > gets discarded by the new
    > > > > Att/Prop.
    > > > >
    > > > > This would be simple and fast with a constant
    > > speed penalty (ie
    > > > > independent on n).
    > > > >
    > > > > > > 4. Mike Nordell's suggestion of string =>
    > > integer
    > > > > > > translation. I was
    > > > > > > thinking of the same thing. Doing a MD5 sum
    > > might be
    > > > > > > overkill but
    > > > > > > something like that would guarantee a unique
    > > > > > > integer/string key. The
    > > > > > > integer keys could be stored in a binary
    > > vector
    > > > > > > which could be searched
    > > > > > > at order log(n) time. If we're clever with
    > > the data
    > > > > > > structures we might
    > > > > > > be able to do insertions into the array at
    > > order
    > > > > > > log(n) time too.
    > > > > >
    > > > > > MD5Sums and integer IDs break down quickly in
    > > certain
    > > > > > cases. They're acceptable if we define the
    > > granularity
    > > > > > on an appropriate level, and make some other
    > > > > > hard-and-set decisions. Currently, all these
    > > things
    > > > > > look alike to the PT, and are stored in the
    > > same
    > > > > > PP_Attribute:
    > > > > >
    > > > > > "font:Arial;size:12pt"
    > > > > > "size:12pt;font:Arial"
    > > > > > "font:Arial ;size:12pt"
    > > > > >
    > > > >
    > > > > Currently we treat all these as unique AP's. My
    > > suggestion for this
    > > > > technique would continue to treat these are
    > > different even though as you
    > > > > say, they're sematically the same. My GUESS is
    > > that it is not worth the
    > > > > penalty to pull out the semantic information and
    > > order the props
    > > > > accordingly. However I could be wrong and would
    > > be
    >=== message truncated ===
    >
    >
    >__________________________________
    >Do you Yahoo!?
    >The New Yahoo! Shopping - with improved product search
    >http://shopping.yahoo.com

    _________________________________________________________________
    Frustrated with dial-up? Get high-speed for as low as $29.95/month
    (depending on the local service providers in your area).
    https://broadband.msn.com



    This archive was generated by hypermail 2.1.4 : Fri Sep 26 2003 - 16:47:19 EDT