commit: (HEAD) Re: Light at the end of the tunnel Was Re: 5291 - A Task Force

From: msevior@physics.unimelb.edu.au
Date: Sat Sep 27 2003 - 02:16:49 EDT

  • Next message: Sharuzzaman Ahmat Raslan: "Malay (ms_MY) not in set language"

    Great Work. Thanks!

    Feel free to clean up bits. (I did some already.)

    CVS: ----------------------------------------------------------------------
    CVS: Enter Log. Lines beginning with `CVS:' are removed automatically
    CVS:
    CVS: Committing in .
    CVS:
    CVS: Modified Files:
    CVS: CREDITS.TXT src/text/ptbl/xp/pp_AttrProp.cpp
    CVS: src/text/ptbl/xp/pp_AttrProp.h
    CVS: src/text/ptbl/xp/pp_TableAttrProp.cpp
    CVS: src/text/ptbl/xp/pp_TableAttrProp.h
    CVS: src/text/ptbl/xp/pt_VarSet.cpp
    CVS: ----------------------------------------------------------------------
    Johnny Lee's fantastic speed improvement patch. 5291 is much better.

    > I took a look at this problem and have come up with a rough solution.
    >
    > The changes need some cleanup though.
    >
    > My changes reduce the import time (measured from when I click Open on
    > the Open File dialog box until 'Importing Document...' disappears from
    > the status bar) from ~120 seconds to ~22 seconds.
    >
    > The document spends at keast another 40 seconds building the document
    > though.
    >
    > Main changes:
    > - better checksum (aka hashcode) to help differentiate between AP
    > objects. - sort the AP objects so we can use binary search
    >
    > Description and patches can be found at
    > <http://www.geocities.com/typopl/bug5291.html>.
    >
    > J
    >
    >
    >>From: Martin Sevior <msevior@seviorpc.ph.unimelb.edu.au>
    >>Reply-To: msevior@physics.unimelb.edu.au
    >>To: msevior@physics.unimelb.edu.au
    >>CC: Dom Lachowicz <domlachowicz@yahoo.com>,AbiWord
    >><abiword-dev@abisource.com>
    >>Subject: Re: 5291 - A Task Force
    >>Date: Fri, 26 Sep 2003 14:26:54 +1000
    >>
    >>On Fri, 2003-09-26 at 12:14, Martin Sevior wrote:
    >> > On Fri, 2003-09-26 at 02:13, Dom Lachowicz wrote:
    >> > > > 1. Just #ifdef out the comparison in addIfUnique so
    >> > > > we just add the
    >> > > > strings. This provides no penalty for speed at all.
    >> > > > The downside is that
    >> > > > the attributes/properties grow linearly with
    >> > > > document size and every
    >> > > > time some presses a key.
    >> > >
    >> > > I agree with Tomas. This is a bad idea.
    >> > >
    >> > > > 2. Be selective about addIfUnique. We don't do the
    >> > > > ifUnique upon
    >> > > > document loading but keep it during editing. This
    >> > > > way the document
    >> > > > doesn't grow any bigger than needed during editing
    >> > > > but we keep the fast
    >> > > > load.
    >> > >
    >> > > This sounds better than the above, but is probably a
    >> > > bad idea for similar reasons that Tomas gave to the
    >> > > above.
    >> >
    >> > I have played with this a bit. It has the downside of doing more
    >> lookups during editting but the loading is much faster and is really
    >> easy to implement. I'm about to commit a patch that implements this
    >> as the default until we come up with something better. Feel free to
    >> flame we with quantitative reasons (ie some memory usage statistics
    >> before and after) why we should not go this route.
    >> >
    >> > (I'm not doing this because I think it's best, only because it's
    >> better than what we have now.)
    >>
    >>
    >>Dom told me I should post statistics first. Then commit if OK.
    >>
    >>I loaded the RTF-1.7 specificiation (in RTF form). I had CVS abiword
    >> compiled with full optimizations.
    >>
    >>Without the speed up patch, the memory consumption was 72 Megabytes.
    >> With this simple speed up patch, the memory consumption was 102
    >>Megabytes.
    >>
    >>This 30% increase in memory usage is significant. I think Dom and I
    >> decided to go down the route described in 4 using crc32 checksums to
    >> give us unique keys to the strings.
    >>
    >>Cheers
    >>
    >>Martin
    >> > >
    >> > > > 3. Have a cache of most recently used
    >> > > > Attribute/Properties. We do a
    >> > > > comparison with the properties in the cache and add
    >> > > > them if the
    >> > > > properties are not in the cache. This gives us a
    >> > > > constant penalty with
    >> > > > some trade-off of document size versus speed.
    >> > >
    >> > > I think that this is going to hurt us
    >> > > performance-wise. I don't see how one can make the
    >> > > assumption that one is likely to hit duplicate styles
    >> > > within short spans of time. Even still, we'd still
    >> > > need a method to compare the attributes, and a
    >> > > mechanism for passivating parts of the cache. I
    >> > > haven't done any thorough testing on this, but it
    >> > > seems to have all of the above flaws plus some new
    >> > > ones of its own.
    >> > >
    >> >
    >> > I think a simple buffer with a couple of heuristics could do rather
    >> well. We could play with the cache size to see how much effect that
    >> has.
    >> >
    >> > The heuristics would be: Every time an Att/Prop that does not find a
    >> match in the cache it replaces a Att/Prop in the cache and gets a
    >> new indexAP.
    >> >
    >> > Att/Props in the cache which get matched the score count incremented
    >> by one.
    >> >
    >> > The oldest Att/Prop with the lowest hit count gets discarded by the
    >> new Att/Prop.
    >> >
    >> > This would be simple and fast with a constant speed penalty (ie
    >> independent on n).
    >> >
    >> > > > 4. Mike Nordell's suggestion of string => integer
    >> > > > translation. I was
    >> > > > thinking of the same thing. Doing a MD5 sum might be
    >> > > > overkill but
    >> > > > something like that would guarantee a unique
    >> > > > integer/string key. The
    >> > > > integer keys could be stored in a binary vector
    >> > > > which could be searched
    >> > > > at order log(n) time. If we're clever with the data
    >> > > > structures we might
    >> > > > be able to do insertions into the array at order
    >> > > > log(n) time too.
    >> > >
    >> > > MD5Sums and integer IDs break down quickly in certain
    >> > > cases. They're acceptable if we define the granularity
    >> > > on an appropriate level, and make some other
    >> > > hard-and-set decisions. Currently, all these things
    >> > > look alike to the PT, and are stored in the same
    >> > > PP_Attribute:
    >> > >
    >> > > "font:Arial;size:12pt"
    >> > > "size:12pt;font:Arial"
    >> > > "font:Arial ;size:12pt"
    >> > >
    >> >
    >> > Currently we treat all these as unique AP's. My suggestion for this
    >> technique would continue to treat these are different even though as
    >> you say, they're sematically the same. My GUESS is that it is not
    >> worth the penalty to pull out the semantic information and order the
    >> props accordingly. However I could be wrong and would be happy to
    >> see some test data to show otherwise.
    >> >
    >> > > Do we want all of these treated the same?
    >> > > Semantically, they're all identical. As far as strcmp,
    >> > > MD5, etc... goes, they're all radically different
    >> > > strings.
    >> > >
    >> >
    >> > Yes.
    >> >
    >> > > If we want to treat these things as different, then
    >> > > using integer IDs is easy. There's no need for MD5sums
    >> > > - 'i++;' will do quite nicely.
    >> >
    >> > I don't get this. My proposal would be that every new new string
    >> gets a one computation to turn it into a unique integer.
    >> >
    >> > eg strings from the importers appear in this order and get
    >> translated into a unique integer
    >> > "font:Arial;size:12pt" => 2134563234
    >> > "size:12pt;font:Arial" => 3124563234
    >> > "font:Arial ;size:12pt" => 9234567234
    >> > "find:Times New Roman; size:14pt" => 89345673234
    >> > "font:Arial;size:12pt" => 2134563234
    >> >
    >> > Then instead of comparing strings we simply compare the
    >> keyed-integer's. If the integers match we have identical strings.
    >> This comparison is faster than string matching and has the
    >> additional advantage of allowing the strings to be ordered allowing
    >> a binary comparison or placed in a hash table.
    >> >
    >> > > However, if we treat
    >> > > these as different, the memory usage will likely go
    >> > > up. By how much (or little) remains to be seen.
    >> > >
    >> > > Finally, these suggestions, while productive, might
    >> > > not be the right place to solve this problem. We have
    >> > > quadratic behavior in our comparison routine. We must
    >> > > perform 47,000,000 comparisons from something that's
    >> > > only called 27,000 times. Yes, strcmp is a little
    >> > > wasteful. Is the solution to replace strcmp with an
    >> > > integer comparison, or to find a solution that doesn't
    >> > > require calling strcmp (or an equivalent integer
    >> > > comparison) 47 million times?
    >> > >
    >> >
    >> > The above suggestions would all do this. (Well suggestion 2 would
    >> only solve the comparison problem during import.)
    >> >
    >> > > Discuss amongst yourselves.
    >> > >
    >> >
    >> > We are :-)
    >> >
    >> > Cheers
    >> >
    >> > Martin
    >> >
    >> >
    >>
    >
    > _________________________________________________________________
    > Instant message in style with MSN Messenger 6.0. Download it now FREE!
    > http://msnmessenger-download.com



    This archive was generated by hypermail 2.1.4 : Sat Sep 27 2003 - 02:36:24 EDT