Re: Light at the end of the tunnel Was Re: 5291 - A Task Force

From: msevior@physics.unimelb.edu.au
Date: Fri Sep 26 2003 - 19:32:45 EDT

Next message: msevior@physics.unimelb.edu.au: "commit: Fix printing of tables."

Previous message: Hubert Figuiere: "Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"
In reply to: Johnny Lee: "Light at the end of the tunnel Was Re: 5291 - A Task Force"
Next in thread: msevior@physics.unimelb.edu.au: "commit: (HEAD) Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

> I took a look at this problem and have come up with a rough solution.
>
> The changes need some cleanup though.
>
> My changes reduce the import time (measured from when I click Open on
> the Open File dialog box until 'Importing Document...' disappears from
> the status bar) from ~120 seconds to ~22 seconds.
>
> The document spends at keast another 40 seconds building the document
> though.
>
> Main changes:
> - better checksum (aka hashcode) to help differentiate between AP
> objects. - sort the AP objects so we can use binary search
>
> Description and patches can be found at
> <http://www.geocities.com/typopl/bug5291.html>.
>
> J
>

HI Johnny,
Thanks VERY much for this. This is very similar to what Dom and
were I thinking of doing. The only difference is that I
attempted to use the UT_Map.cpp class which is meant to use a
Red-Black tree for storing the CRC checksums. The advantage of
the UT_Map solution is that INSERTIONS are order log(n) as well
as searches.

The vector approach has insertions at order n.

Never-the-less my code fell flat on it's face, whereas yours (apprently)
works!

This will speed up AbiWord performance for large docs a lot and once
verified I fully support including this patch.

Cheers

Martin

PS. There is at least one bug the UT_CRC32 calculation.
>
>>From: Martin Sevior <msevior@seviorpc.ph.unimelb.edu.au>
>>Reply-To: msevior@physics.unimelb.edu.au
>>To: msevior@physics.unimelb.edu.au
>>CC: Dom Lachowicz <domlachowicz@yahoo.com>,AbiWord
>><abiword-dev@abisource.com>
>>Subject: Re: 5291 - A Task Force
>>Date: Fri, 26 Sep 2003 14:26:54 +1000
>>
>>On Fri, 2003-09-26 at 12:14, Martin Sevior wrote:
>> > On Fri, 2003-09-26 at 02:13, Dom Lachowicz wrote:
>> > > > 1. Just #ifdef out the comparison in addIfUnique so
>> > > > we just add the
>> > > > strings. This provides no penalty for speed at all.
>> > > > The downside is that
>> > > > the attributes/properties grow linearly with
>> > > > document size and every
>> > > > time some presses a key.
>> > >
>> > > I agree with Tomas. This is a bad idea.
>> > >
>> > > > 2. Be selective about addIfUnique. We don't do the
>> > > > ifUnique upon
>> > > > document loading but keep it during editing. This
>> > > > way the document
>> > > > doesn't grow any bigger than needed during editing
>> > > > but we keep the fast
>> > > > load.
>> > >
>> > > This sounds better than the above, but is probably a
>> > > bad idea for similar reasons that Tomas gave to the
>> > > above.
>> >
>> > I have played with this a bit. It has the downside of doing more
>> lookups during editting but the loading is much faster and is really
>> easy to implement. I'm about to commit a patch that implements this
>> as the default until we come up with something better. Feel free to
>> flame we with quantitative reasons (ie some memory usage statistics
>> before and after) why we should not go this route.
>> >
>> > (I'm not doing this because I think it's best, only because it's
>> better than what we have now.)
>>
>>
>>Dom told me I should post statistics first. Then commit if OK.
>>
>>I loaded the RTF-1.7 specificiation (in RTF form). I had CVS abiword
>> compiled with full optimizations.
>>
>>Without the speed up patch, the memory consumption was 72 Megabytes.
>> With this simple speed up patch, the memory consumption was 102
>>Megabytes.
>>
>>This 30% increase in memory usage is significant. I think Dom and I
>> decided to go down the route described in 4 using crc32 checksums to
>> give us unique keys to the strings.
>>
>>Cheers
>>
>>Martin
>> > >
>> > > > 3. Have a cache of most recently used
>> > > > Attribute/Properties. We do a
>> > > > comparison with the properties in the cache and add
>> > > > them if the
>> > > > properties are not in the cache. This gives us a
>> > > > constant penalty with
>> > > > some trade-off of document size versus speed.
>> > >
>> > > I think that this is going to hurt us
>> > > performance-wise. I don't see how one can make the
>> > > assumption that one is likely to hit duplicate styles
>> > > within short spans of time. Even still, we'd still
>> > > need a method to compare the attributes, and a
>> > > mechanism for passivating parts of the cache. I
>> > > haven't done any thorough testing on this, but it
>> > > seems to have all of the above flaws plus some new
>> > > ones of its own.
>> > >
>> >
>> > I think a simple buffer with a couple of heuristics could do rather
>> well. We could play with the cache size to see how much effect that
>> has.
>> >
>> > The heuristics would be: Every time an Att/Prop that does not find a
>> match in the cache it replaces a Att/Prop in the cache and gets a
>> new indexAP.
>> >
>> > Att/Props in the cache which get matched the score count incremented
>> by one.
>> >
>> > The oldest Att/Prop with the lowest hit count gets discarded by the
>> new Att/Prop.
>> >
>> > This would be simple and fast with a constant speed penalty (ie
>> independent on n).
>> >
>> > > > 4. Mike Nordell's suggestion of string => integer
>> > > > translation. I was
>> > > > thinking of the same thing. Doing a MD5 sum might be
>> > > > overkill but
>> > > > something like that would guarantee a unique
>> > > > integer/string key. The
>> > > > integer keys could be stored in a binary vector
>> > > > which could be searched
>> > > > at order log(n) time. If we're clever with the data
>> > > > structures we might
>> > > > be able to do insertions into the array at order
>> > > > log(n) time too.
>> > >
>> > > MD5Sums and integer IDs break down quickly in certain
>> > > cases. They're acceptable if we define the granularity
>> > > on an appropriate level, and make some other
>> > > hard-and-set decisions. Currently, all these things
>> > > look alike to the PT, and are stored in the same
>> > > PP_Attribute:
>> > >
>> > > "font:Arial;size:12pt"
>> > > "size:12pt;font:Arial"
>> > > "font:Arial ;size:12pt"
>> > >
>> >
>> > Currently we treat all these as unique AP's. My suggestion for this
>> technique would continue to treat these are different even though as
>> you say, they're sematically the same. My GUESS is that it is not
>> worth the penalty to pull out the semantic information and order the
>> props accordingly. However I could be wrong and would be happy to
>> see some test data to show otherwise.
>> >
>> > > Do we want all of these treated the same?
>> > > Semantically, they're all identical. As far as strcmp,
>> > > MD5, etc... goes, they're all radically different
>> > > strings.
>> > >
>> >
>> > Yes.
>> >
>> > > If we want to treat these things as different, then
>> > > using integer IDs is easy. There's no need for MD5sums
>> > > - 'i++;' will do quite nicely.
>> >
>> > I don't get this. My proposal would be that every new new string
>> gets a one computation to turn it into a unique integer.
>> >
>> > eg strings from the importers appear in this order and get
>> translated into a unique integer
>> > "font:Arial;size:12pt" => 2134563234
>> > "size:12pt;font:Arial" => 3124563234
>> > "font:Arial ;size:12pt" => 9234567234
>> > "find:Times New Roman; size:14pt" => 89345673234
>> > "font:Arial;size:12pt" => 2134563234
>> >
>> > Then instead of comparing strings we simply compare the
>> keyed-integer's. If the integers match we have identical strings.
>> This comparison is faster than string matching and has the
>> additional advantage of allowing the strings to be ordered allowing
>> a binary comparison or placed in a hash table.
>> >
>> > > However, if we treat
>> > > these as different, the memory usage will likely go
>> > > up. By how much (or little) remains to be seen.
>> > >
>> > > Finally, these suggestions, while productive, might
>> > > not be the right place to solve this problem. We have
>> > > quadratic behavior in our comparison routine. We must
>> > > perform 47,000,000 comparisons from something that's
>> > > only called 27,000 times. Yes, strcmp is a little
>> > > wasteful. Is the solution to replace strcmp with an
>> > > integer comparison, or to find a solution that doesn't
>> > > require calling strcmp (or an equivalent integer
>> > > comparison) 47 million times?
>> > >
>> >
>> > The above suggestions would all do this. (Well suggestion 2 would
>> only solve the comparison problem during import.)
>> >
>> > > Discuss amongst yourselves.
>> > >
>> >
>> > We are :-)
>> >
>> > Cheers
>> >
>> > Martin
>> >
>> >
>>
>
> _________________________________________________________________
> Instant message in style with MSN Messenger 6.0. Download it now FREE!
> http://msnmessenger-download.com

Next message: msevior@physics.unimelb.edu.au: "commit: Fix printing of tables."
Previous message: Hubert Figuiere: "Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"
In reply to: Johnny Lee: "Light at the end of the tunnel Was Re: 5291 - A Task Force"
Next in thread: msevior@physics.unimelb.edu.au: "commit: (HEAD) Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Fri Sep 26 2003 - 19:52:17 EDT