From: Johnny Lee (typo_pl@hotmail.com)
Date: Fri Sep 26 2003 - 15:39:42 EDT
I took a look at this problem and have come up with a rough solution.
The changes need some cleanup though.
My changes reduce the import time (measured from when I click Open on the
Open File dialog box until 'Importing Document...' disappears from the
status bar) from ~120 seconds to ~22 seconds.
The document spends at keast another 40 seconds building the document
though.
Main changes:
- better checksum (aka hashcode) to help differentiate between AP objects.
- sort the AP objects so we can use binary search
Description and patches can be found at
<http://www.geocities.com/typopl/bug5291.html>.
J
>From: Martin Sevior <msevior@seviorpc.ph.unimelb.edu.au>
>Reply-To: msevior@physics.unimelb.edu.au
>To: msevior@physics.unimelb.edu.au
>CC: Dom Lachowicz <domlachowicz@yahoo.com>,AbiWord
><abiword-dev@abisource.com>
>Subject: Re: 5291 - A Task Force
>Date: Fri, 26 Sep 2003 14:26:54 +1000
>
>On Fri, 2003-09-26 at 12:14, Martin Sevior wrote:
> > On Fri, 2003-09-26 at 02:13, Dom Lachowicz wrote:
> > > > 1. Just #ifdef out the comparison in addIfUnique so
> > > > we just add the
> > > > strings. This provides no penalty for speed at all.
> > > > The downside is that
> > > > the attributes/properties grow linearly with
> > > > document size and every
> > > > time some presses a key.
> > >
> > > I agree with Tomas. This is a bad idea.
> > >
> > > > 2. Be selective about addIfUnique. We don't do the
> > > > ifUnique upon
> > > > document loading but keep it during editing. This
> > > > way the document
> > > > doesn't grow any bigger than needed during editing
> > > > but we keep the fast
> > > > load.
> > >
> > > This sounds better than the above, but is probably a
> > > bad idea for similar reasons that Tomas gave to the
> > > above.
> >
> > I have played with this a bit. It has the downside of doing more lookups
> > during editting but the loading is much faster and is really easy to
> > implement. I'm about to commit a patch that implements this as the
> > default until we come up with something better. Feel free to flame we
> > with quantitative reasons (ie some memory usage statistics before and
> > after) why we should not go this route.
> >
> > (I'm not doing this because I think it's best, only because it's better
> > than what we have now.)
>
>
>Dom told me I should post statistics first. Then commit if OK.
>
>I loaded the RTF-1.7 specificiation (in RTF form). I had CVS abiword
>compiled with full optimizations.
>
>Without the speed up patch, the memory consumption was 72 Megabytes.
>With this simple speed up patch, the memory consumption was 102
>Megabytes.
>
>This 30% increase in memory usage is significant. I think Dom and I
>decided to go down the route described in 4 using crc32 checksums to
>give us unique keys to the strings.
>
>Cheers
>
>Martin
> > >
> > > > 3. Have a cache of most recently used
> > > > Attribute/Properties. We do a
> > > > comparison with the properties in the cache and add
> > > > them if the
> > > > properties are not in the cache. This gives us a
> > > > constant penalty with
> > > > some trade-off of document size versus speed.
> > >
> > > I think that this is going to hurt us
> > > performance-wise. I don't see how one can make the
> > > assumption that one is likely to hit duplicate styles
> > > within short spans of time. Even still, we'd still
> > > need a method to compare the attributes, and a
> > > mechanism for passivating parts of the cache. I
> > > haven't done any thorough testing on this, but it
> > > seems to have all of the above flaws plus some new
> > > ones of its own.
> > >
> >
> > I think a simple buffer with a couple of heuristics could do rather
> > well. We could play with the cache size to see how much effect that has.
> >
> > The heuristics would be: Every time an Att/Prop that does not find a
> > match in the cache it replaces a Att/Prop in the cache and gets a new
> > indexAP.
> >
> > Att/Props in the cache which get matched the score count incremented by
> > one.
> >
> > The oldest Att/Prop with the lowest hit count gets discarded by the new
> > Att/Prop.
> >
> > This would be simple and fast with a constant speed penalty (ie
> > independent on n).
> >
> > > > 4. Mike Nordell's suggestion of string => integer
> > > > translation. I was
> > > > thinking of the same thing. Doing a MD5 sum might be
> > > > overkill but
> > > > something like that would guarantee a unique
> > > > integer/string key. The
> > > > integer keys could be stored in a binary vector
> > > > which could be searched
> > > > at order log(n) time. If we're clever with the data
> > > > structures we might
> > > > be able to do insertions into the array at order
> > > > log(n) time too.
> > >
> > > MD5Sums and integer IDs break down quickly in certain
> > > cases. They're acceptable if we define the granularity
> > > on an appropriate level, and make some other
> > > hard-and-set decisions. Currently, all these things
> > > look alike to the PT, and are stored in the same
> > > PP_Attribute:
> > >
> > > "font:Arial;size:12pt"
> > > "size:12pt;font:Arial"
> > > "font:Arial ;size:12pt"
> > >
> >
> > Currently we treat all these as unique AP's. My suggestion for this
> > technique would continue to treat these are different even though as you
> > say, they're sematically the same. My GUESS is that it is not worth the
> > penalty to pull out the semantic information and order the props
> > accordingly. However I could be wrong and would be happy to see some
> > test data to show otherwise.
> >
> > > Do we want all of these treated the same?
> > > Semantically, they're all identical. As far as strcmp,
> > > MD5, etc... goes, they're all radically different
> > > strings.
> > >
> >
> > Yes.
> >
> > > If we want to treat these things as different, then
> > > using integer IDs is easy. There's no need for MD5sums
> > > - 'i++;' will do quite nicely.
> >
> > I don't get this. My proposal would be that every new new string gets a
> > one computation to turn it into a unique integer.
> >
> > eg strings from the importers appear in this order and get translated
> > into a unique integer
> > "font:Arial;size:12pt" => 2134563234
> > "size:12pt;font:Arial" => 3124563234
> > "font:Arial ;size:12pt" => 9234567234
> > "find:Times New Roman; size:14pt" => 89345673234
> > "font:Arial;size:12pt" => 2134563234
> >
> > Then instead of comparing strings we simply compare the keyed-integer's.
> > If the integers match we have identical strings. This comparison is
> > faster than string matching and has the additional advantage of allowing
> > the strings to be ordered allowing a binary comparison or placed in a
> > hash table.
> >
> > > However, if we treat
> > > these as different, the memory usage will likely go
> > > up. By how much (or little) remains to be seen.
> > >
> > > Finally, these suggestions, while productive, might
> > > not be the right place to solve this problem. We have
> > > quadratic behavior in our comparison routine. We must
> > > perform 47,000,000 comparisons from something that's
> > > only called 27,000 times. Yes, strcmp is a little
> > > wasteful. Is the solution to replace strcmp with an
> > > integer comparison, or to find a solution that doesn't
> > > require calling strcmp (or an equivalent integer
> > > comparison) 47 million times?
> > >
> >
> > The above suggestions would all do this. (Well suggestion 2 would only
> > solve the comparison problem during import.)
> >
> > > Discuss amongst yourselves.
> > >
> >
> > We are :-)
> >
> > Cheers
> >
> > Martin
> >
> >
>
_________________________________________________________________
Instant message in style with MSN Messenger 6.0. Download it now FREE!
http://msnmessenger-download.com
This archive was generated by hypermail 2.1.4 : Fri Sep 26 2003 - 15:54:42 EDT