Light at the end of the tunnel Was Re: 5291 - A Task Force

From: Johnny Lee (typo_pl@hotmail.com)
Date: Fri Sep 26 2003 - 15:39:42 EDT

  • Next message: Dom Lachowicz: "Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"

    I took a look at this problem and have come up with a rough solution.

    The changes need some cleanup though.

    My changes reduce the import time (measured from when I click Open on the
    Open File dialog box until 'Importing Document...' disappears from the
    status bar) from ~120 seconds to ~22 seconds.

    The document spends at keast another 40 seconds building the document
    though.

    Main changes:
    - better checksum (aka hashcode) to help differentiate between AP objects.
    - sort the AP objects so we can use binary search

    Description and patches can be found at
    <http://www.geocities.com/typopl/bug5291.html>.

    J

    >From: Martin Sevior <msevior@seviorpc.ph.unimelb.edu.au>
    >Reply-To: msevior@physics.unimelb.edu.au
    >To: msevior@physics.unimelb.edu.au
    >CC: Dom Lachowicz <domlachowicz@yahoo.com>,AbiWord
    ><abiword-dev@abisource.com>
    >Subject: Re: 5291 - A Task Force
    >Date: Fri, 26 Sep 2003 14:26:54 +1000
    >
    >On Fri, 2003-09-26 at 12:14, Martin Sevior wrote:
    > > On Fri, 2003-09-26 at 02:13, Dom Lachowicz wrote:
    > > > > 1. Just #ifdef out the comparison in addIfUnique so
    > > > > we just add the
    > > > > strings. This provides no penalty for speed at all.
    > > > > The downside is that
    > > > > the attributes/properties grow linearly with
    > > > > document size and every
    > > > > time some presses a key.
    > > >
    > > > I agree with Tomas. This is a bad idea.
    > > >
    > > > > 2. Be selective about addIfUnique. We don't do the
    > > > > ifUnique upon
    > > > > document loading but keep it during editing. This
    > > > > way the document
    > > > > doesn't grow any bigger than needed during editing
    > > > > but we keep the fast
    > > > > load.
    > > >
    > > > This sounds better than the above, but is probably a
    > > > bad idea for similar reasons that Tomas gave to the
    > > > above.
    > >
    > > I have played with this a bit. It has the downside of doing more lookups
    > > during editting but the loading is much faster and is really easy to
    > > implement. I'm about to commit a patch that implements this as the
    > > default until we come up with something better. Feel free to flame we
    > > with quantitative reasons (ie some memory usage statistics before and
    > > after) why we should not go this route.
    > >
    > > (I'm not doing this because I think it's best, only because it's better
    > > than what we have now.)
    >
    >
    >Dom told me I should post statistics first. Then commit if OK.
    >
    >I loaded the RTF-1.7 specificiation (in RTF form). I had CVS abiword
    >compiled with full optimizations.
    >
    >Without the speed up patch, the memory consumption was 72 Megabytes.
    >With this simple speed up patch, the memory consumption was 102
    >Megabytes.
    >
    >This 30% increase in memory usage is significant. I think Dom and I
    >decided to go down the route described in 4 using crc32 checksums to
    >give us unique keys to the strings.
    >
    >Cheers
    >
    >Martin
    > > >
    > > > > 3. Have a cache of most recently used
    > > > > Attribute/Properties. We do a
    > > > > comparison with the properties in the cache and add
    > > > > them if the
    > > > > properties are not in the cache. This gives us a
    > > > > constant penalty with
    > > > > some trade-off of document size versus speed.
    > > >
    > > > I think that this is going to hurt us
    > > > performance-wise. I don't see how one can make the
    > > > assumption that one is likely to hit duplicate styles
    > > > within short spans of time. Even still, we'd still
    > > > need a method to compare the attributes, and a
    > > > mechanism for passivating parts of the cache. I
    > > > haven't done any thorough testing on this, but it
    > > > seems to have all of the above flaws plus some new
    > > > ones of its own.
    > > >
    > >
    > > I think a simple buffer with a couple of heuristics could do rather
    > > well. We could play with the cache size to see how much effect that has.
    > >
    > > The heuristics would be: Every time an Att/Prop that does not find a
    > > match in the cache it replaces a Att/Prop in the cache and gets a new
    > > indexAP.
    > >
    > > Att/Props in the cache which get matched the score count incremented by
    > > one.
    > >
    > > The oldest Att/Prop with the lowest hit count gets discarded by the new
    > > Att/Prop.
    > >
    > > This would be simple and fast with a constant speed penalty (ie
    > > independent on n).
    > >
    > > > > 4. Mike Nordell's suggestion of string => integer
    > > > > translation. I was
    > > > > thinking of the same thing. Doing a MD5 sum might be
    > > > > overkill but
    > > > > something like that would guarantee a unique
    > > > > integer/string key. The
    > > > > integer keys could be stored in a binary vector
    > > > > which could be searched
    > > > > at order log(n) time. If we're clever with the data
    > > > > structures we might
    > > > > be able to do insertions into the array at order
    > > > > log(n) time too.
    > > >
    > > > MD5Sums and integer IDs break down quickly in certain
    > > > cases. They're acceptable if we define the granularity
    > > > on an appropriate level, and make some other
    > > > hard-and-set decisions. Currently, all these things
    > > > look alike to the PT, and are stored in the same
    > > > PP_Attribute:
    > > >
    > > > "font:Arial;size:12pt"
    > > > "size:12pt;font:Arial"
    > > > "font:Arial ;size:12pt"
    > > >
    > >
    > > Currently we treat all these as unique AP's. My suggestion for this
    > > technique would continue to treat these are different even though as you
    > > say, they're sematically the same. My GUESS is that it is not worth the
    > > penalty to pull out the semantic information and order the props
    > > accordingly. However I could be wrong and would be happy to see some
    > > test data to show otherwise.
    > >
    > > > Do we want all of these treated the same?
    > > > Semantically, they're all identical. As far as strcmp,
    > > > MD5, etc... goes, they're all radically different
    > > > strings.
    > > >
    > >
    > > Yes.
    > >
    > > > If we want to treat these things as different, then
    > > > using integer IDs is easy. There's no need for MD5sums
    > > > - 'i++;' will do quite nicely.
    > >
    > > I don't get this. My proposal would be that every new new string gets a
    > > one computation to turn it into a unique integer.
    > >
    > > eg strings from the importers appear in this order and get translated
    > > into a unique integer
    > > "font:Arial;size:12pt" => 2134563234
    > > "size:12pt;font:Arial" => 3124563234
    > > "font:Arial ;size:12pt" => 9234567234
    > > "find:Times New Roman; size:14pt" => 89345673234
    > > "font:Arial;size:12pt" => 2134563234
    > >
    > > Then instead of comparing strings we simply compare the keyed-integer's.
    > > If the integers match we have identical strings. This comparison is
    > > faster than string matching and has the additional advantage of allowing
    > > the strings to be ordered allowing a binary comparison or placed in a
    > > hash table.
    > >
    > > > However, if we treat
    > > > these as different, the memory usage will likely go
    > > > up. By how much (or little) remains to be seen.
    > > >
    > > > Finally, these suggestions, while productive, might
    > > > not be the right place to solve this problem. We have
    > > > quadratic behavior in our comparison routine. We must
    > > > perform 47,000,000 comparisons from something that's
    > > > only called 27,000 times. Yes, strcmp is a little
    > > > wasteful. Is the solution to replace strcmp with an
    > > > integer comparison, or to find a solution that doesn't
    > > > require calling strcmp (or an equivalent integer
    > > > comparison) 47 million times?
    > > >
    > >
    > > The above suggestions would all do this. (Well suggestion 2 would only
    > > solve the comparison problem during import.)
    > >
    > > > Discuss amongst yourselves.
    > > >
    > >
    > > We are :-)
    > >
    > > Cheers
    > >
    > > Martin
    > >
    > >
    >

    _________________________________________________________________
    Instant message in style with MSN Messenger 6.0. Download it now FREE!
    http://msnmessenger-download.com



    This archive was generated by hypermail 2.1.4 : Fri Sep 26 2003 - 15:54:42 EDT