From: Johnny Lee (typo_pl@hotmail.com)
Date: Fri Sep 26 2003 - 16:32:06 EDT
Comments inline below.
>From: Dom Lachowicz <domlachowicz@yahoo.com>
>To: AbiWord <abiword-dev@abisource.com>
>Subject: Re: Light at the end of the tunnel Was Re: 5291 - A Task Force
>Date: Fri, 26 Sep 2003 12:52:11 -0700 (PDT)
>
>This is almost exactly what Martin and I were planning
>to do (time permitting).
>
>I do have some questions that you may or may not be
>able to answer:
>
>1) Before vs. After memory usage
Besides the increase due to actual code changes,
- there's another vector in the pp_TableAttrProp class. this sorted vector
is the same size as the main vector in pp_TableAttrProp. 4n +20 bytes, where
n is the number of elems in the vector.
- the pp_AttrProp class increases by 4 bytes (the m_index field). This
should probably be moved into the sorted vector table in pp_TableAttrProp. I
wanted to get the changes out before the weekend.
So my estimate in mem size usage is that there's ~8n+20 more bytes used.
The RTF doc had < 10,000 APs I think. So I'd guess <100K for the RTF doc.
When I checked the memory usage of the old and new Abiword exes in windows
task manager, the old one was higher by a couple of MBs. Probably due to
using the mem heap more.
>2) What're the mathematical chances for the Glib
>string hashing algorithm to produce a duplicate hash
>given different data? I'm guessing that it's much
>higher than a CRC32 or MD5/SHA1 sum, given that
>hashtables often dynamically resize or chain buckets
>together when they encounter a collision. Note that
>fixing this just means using a better checksum
>generator.
I don't know how often you'd get duplicate hash values.
The code doesn't even hash every bit of data.
It only looks at up to 8 bytes of each attrib, property, and value string.
When it finds an AP with a matching hash value, it uses that index as the
starting pt for a linear search, stopping when an exact match is found or
the AP-to-check has a different hash value.
As long as there aren't too many duplicates, then the code shouldn't
degenerate into the old case.
I didn't measure how many hash dupes there were in the RTF doc, but if you
look at the number of calls to PP_AttrProp::isExactMatch from
pp_TableAttrProp::findMatch in the profile results for the new Abiword
(52284 vs 27492), there aren't many dupes for the APs that Abiword was
looking for, i.e. < 2 on avg.
J
>
>Other than that, this looks like a superb idea. It
>gets rid of the O(n^2) behavior AND removes the bulk
>of the strcmps. Well done.
>
>Dom
>
>--- Johnny Lee <typo_pl@hotmail.com> wrote:
> > I took a look at this problem and have come up with
> > a rough solution.
> >
> > The changes need some cleanup though.
> >
> > My changes reduce the import time (measured from
> > when I click Open on the
> > Open File dialog box until 'Importing Document...'
> > disappears from the
> > status bar) from ~120 seconds to ~22 seconds.
> >
> > The document spends at keast another 40 seconds
> > building the document
> > though.
> >
> > Main changes:
> > - better checksum (aka hashcode) to help
> > differentiate between AP objects.
> > - sort the AP objects so we can use binary search
> >
> > Description and patches can be found at
> > <http://www.geocities.com/typopl/bug5291.html>.
> >
> > J
> >
> >
> > >From: Martin Sevior
> > <msevior@seviorpc.ph.unimelb.edu.au>
> > >Reply-To: msevior@physics.unimelb.edu.au
> > >To: msevior@physics.unimelb.edu.au
> > >CC: Dom Lachowicz <domlachowicz@yahoo.com>,AbiWord
> > ><abiword-dev@abisource.com>
> > >Subject: Re: 5291 - A Task Force
> > >Date: Fri, 26 Sep 2003 14:26:54 +1000
> > >
> > >On Fri, 2003-09-26 at 12:14, Martin Sevior wrote:
> > > > On Fri, 2003-09-26 at 02:13, Dom Lachowicz
> > wrote:
> > > > > > 1. Just #ifdef out the comparison in
> > addIfUnique so
> > > > > > we just add the
> > > > > > strings. This provides no penalty for speed
> > at all.
> > > > > > The downside is that
> > > > > > the attributes/properties grow linearly with
> > > > > > document size and every
> > > > > > time some presses a key.
> > > > >
> > > > > I agree with Tomas. This is a bad idea.
> > > > >
> > > > > > 2. Be selective about addIfUnique. We don't
> > do the
> > > > > > ifUnique upon
> > > > > > document loading but keep it during editing.
> > This
> > > > > > way the document
> > > > > > doesn't grow any bigger than needed during
> > editing
> > > > > > but we keep the fast
> > > > > > load.
> > > > >
> > > > > This sounds better than the above, but is
> > probably a
> > > > > bad idea for similar reasons that Tomas gave
> > to the
> > > > > above.
> > > >
> > > > I have played with this a bit. It has the
> > downside of doing more lookups
> > > > during editting but the loading is much faster
> > and is really easy to
> > > > implement. I'm about to commit a patch that
> > implements this as the
> > > > default until we come up with something better.
> > Feel free to flame we
> > > > with quantitative reasons (ie some memory usage
> > statistics before and
> > > > after) why we should not go this route.
> > > >
> > > > (I'm not doing this because I think it's best,
> > only because it's better
> > > > than what we have now.)
> > >
> > >
> > >Dom told me I should post statistics first. Then
> > commit if OK.
> > >
> > >I loaded the RTF-1.7 specificiation (in RTF form).
> > I had CVS abiword
> > >compiled with full optimizations.
> > >
> > >Without the speed up patch, the memory consumption
> > was 72 Megabytes.
> > >With this simple speed up patch, the memory
> > consumption was 102
> > >Megabytes.
> > >
> > >This 30% increase in memory usage is significant. I
> > think Dom and I
> > >decided to go down the route described in 4 using
> > crc32 checksums to
> > >give us unique keys to the strings.
> > >
> > >Cheers
> > >
> > >Martin
> > > > >
> > > > > > 3. Have a cache of most recently used
> > > > > > Attribute/Properties. We do a
> > > > > > comparison with the properties in the cache
> > and add
> > > > > > them if the
> > > > > > properties are not in the cache. This gives
> > us a
> > > > > > constant penalty with
> > > > > > some trade-off of document size versus
> > speed.
> > > > >
> > > > > I think that this is going to hurt us
> > > > > performance-wise. I don't see how one can make
> > the
> > > > > assumption that one is likely to hit duplicate
> > styles
> > > > > within short spans of time. Even still, we'd
> > still
> > > > > need a method to compare the attributes, and a
> > > > > mechanism for passivating parts of the cache.
> > I
> > > > > haven't done any thorough testing on this, but
> > it
> > > > > seems to have all of the above flaws plus some
> > new
> > > > > ones of its own.
> > > > >
> > > >
> > > > I think a simple buffer with a couple of
> > heuristics could do rather
> > > > well. We could play with the cache size to see
> > how much effect that has.
> > > >
> > > > The heuristics would be: Every time an Att/Prop
> > that does not find a
> > > > match in the cache it replaces a Att/Prop in the
> > cache and gets a new
> > > > indexAP.
> > > >
> > > > Att/Props in the cache which get matched the
> > score count incremented by
> > > > one.
> > > >
> > > > The oldest Att/Prop with the lowest hit count
> > gets discarded by the new
> > > > Att/Prop.
> > > >
> > > > This would be simple and fast with a constant
> > speed penalty (ie
> > > > independent on n).
> > > >
> > > > > > 4. Mike Nordell's suggestion of string =>
> > integer
> > > > > > translation. I was
> > > > > > thinking of the same thing. Doing a MD5 sum
> > might be
> > > > > > overkill but
> > > > > > something like that would guarantee a unique
> > > > > > integer/string key. The
> > > > > > integer keys could be stored in a binary
> > vector
> > > > > > which could be searched
> > > > > > at order log(n) time. If we're clever with
> > the data
> > > > > > structures we might
> > > > > > be able to do insertions into the array at
> > order
> > > > > > log(n) time too.
> > > > >
> > > > > MD5Sums and integer IDs break down quickly in
> > certain
> > > > > cases. They're acceptable if we define the
> > granularity
> > > > > on an appropriate level, and make some other
> > > > > hard-and-set decisions. Currently, all these
> > things
> > > > > look alike to the PT, and are stored in the
> > same
> > > > > PP_Attribute:
> > > > >
> > > > > "font:Arial;size:12pt"
> > > > > "size:12pt;font:Arial"
> > > > > "font:Arial ;size:12pt"
> > > > >
> > > >
> > > > Currently we treat all these as unique AP's. My
> > suggestion for this
> > > > technique would continue to treat these are
> > different even though as you
> > > > say, they're sematically the same. My GUESS is
> > that it is not worth the
> > > > penalty to pull out the semantic information and
> > order the props
> > > > accordingly. However I could be wrong and would
> > be
>=== message truncated ===
>
>
>__________________________________
>Do you Yahoo!?
>The New Yahoo! Shopping - with improved product search
>http://shopping.yahoo.com
_________________________________________________________________
Frustrated with dial-up? Get high-speed for as low as $29.95/month
(depending on the local service providers in your area).
https://broadband.msn.com
This archive was generated by hypermail 2.1.4 : Fri Sep 26 2003 - 16:47:19 EDT