Re: Light at the end of the tunnel Was Re: 5291 - A Task Force

From: Dom Lachowicz (domlachowicz@yahoo.com)
Date: Fri Sep 26 2003 - 15:52:11 EDT

Next message: Johnny Lee: "Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"

Previous message: Johnny Lee: "Light at the end of the tunnel Was Re: 5291 - A Task Force"
In reply to: Johnny Lee: "Light at the end of the tunnel Was Re: 5291 - A Task Force"
Next in thread: Johnny Lee: "Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This is almost exactly what Martin and I were planning
to do (time permitting).

I do have some questions that you may or may not be
able to answer:

1) Before vs. After memory usage
2) What're the mathematical chances for the Glib
string hashing algorithm to produce a duplicate hash
given different data? I'm guessing that it's much
higher than a CRC32 or MD5/SHA1 sum, given that
hashtables often dynamically resize or chain buckets
together when they encounter a collision. Note that
fixing this just means using a better checksum
generator.

Other than that, this looks like a superb idea. It
gets rid of the O(n^2) behavior AND removes the bulk
of the strcmps. Well done.

Dom

--- Johnny Lee <typo_pl@hotmail.com> wrote:
> I took a look at this problem and have come up with
> a rough solution.
>
> The changes need some cleanup though.
>
> My changes reduce the import time (measured from
> when I click Open on the
> Open File dialog box until 'Importing Document...'
> disappears from the
> status bar) from ~120 seconds to ~22 seconds.
>
> The document spends at keast another 40 seconds
> building the document
> though.
>
> Main changes:
> - better checksum (aka hashcode) to help
> differentiate between AP objects.
> - sort the AP objects so we can use binary search
>
> Description and patches can be found at
> <http://www.geocities.com/typopl/bug5291.html>.
>
> J
>
>
> >From: Martin Sevior
> <msevior@seviorpc.ph.unimelb.edu.au>
> >Reply-To: msevior@physics.unimelb.edu.au
> >To: msevior@physics.unimelb.edu.au
> >CC: Dom Lachowicz <domlachowicz@yahoo.com>,AbiWord
> ><abiword-dev@abisource.com>
> >Subject: Re: 5291 - A Task Force
> >Date: Fri, 26 Sep 2003 14:26:54 +1000
> >
> >On Fri, 2003-09-26 at 12:14, Martin Sevior wrote:
> > > On Fri, 2003-09-26 at 02:13, Dom Lachowicz
> wrote:
> > > > > 1. Just #ifdef out the comparison in
> addIfUnique so
> > > > > we just add the
> > > > > strings. This provides no penalty for speed
> at all.
> > > > > The downside is that
> > > > > the attributes/properties grow linearly with
> > > > > document size and every
> > > > > time some presses a key.
> > > >
> > > > I agree with Tomas. This is a bad idea.
> > > >
> > > > > 2. Be selective about addIfUnique. We don't
> do the
> > > > > ifUnique upon
> > > > > document loading but keep it during editing.
> This
> > > > > way the document
> > > > > doesn't grow any bigger than needed during
> editing
> > > > > but we keep the fast
> > > > > load.
> > > >
> > > > This sounds better than the above, but is
> probably a
> > > > bad idea for similar reasons that Tomas gave
> to the
> > > > above.
> > >
> > > I have played with this a bit. It has the
> downside of doing more lookups
> > > during editting but the loading is much faster
> and is really easy to
> > > implement. I'm about to commit a patch that
> implements this as the
> > > default until we come up with something better.
> Feel free to flame we
> > > with quantitative reasons (ie some memory usage
> statistics before and
> > > after) why we should not go this route.
> > >
> > > (I'm not doing this because I think it's best,
> only because it's better
> > > than what we have now.)
> >
> >
> >Dom told me I should post statistics first. Then
> commit if OK.
> >
> >I loaded the RTF-1.7 specificiation (in RTF form).
> I had CVS abiword
> >compiled with full optimizations.
> >
> >Without the speed up patch, the memory consumption
> was 72 Megabytes.
> >With this simple speed up patch, the memory
> consumption was 102
> >Megabytes.
> >
> >This 30% increase in memory usage is significant. I
> think Dom and I
> >decided to go down the route described in 4 using
> crc32 checksums to
> >give us unique keys to the strings.
> >
> >Cheers
> >
> >Martin
> > > >
> > > > > 3. Have a cache of most recently used
> > > > > Attribute/Properties. We do a
> > > > > comparison with the properties in the cache
> and add
> > > > > them if the
> > > > > properties are not in the cache. This gives
> us a
> > > > > constant penalty with
> > > > > some trade-off of document size versus
> speed.
> > > >
> > > > I think that this is going to hurt us
> > > > performance-wise. I don't see how one can make
> the
> > > > assumption that one is likely to hit duplicate
> styles
> > > > within short spans of time. Even still, we'd
> still
> > > > need a method to compare the attributes, and a
> > > > mechanism for passivating parts of the cache.
> I
> > > > haven't done any thorough testing on this, but
> it
> > > > seems to have all of the above flaws plus some
> new
> > > > ones of its own.
> > > >
> > >
> > > I think a simple buffer with a couple of
> heuristics could do rather
> > > well. We could play with the cache size to see
> how much effect that has.
> > >
> > > The heuristics would be: Every time an Att/Prop
> that does not find a
> > > match in the cache it replaces a Att/Prop in the
> cache and gets a new
> > > indexAP.
> > >
> > > Att/Props in the cache which get matched the
> score count incremented by
> > > one.
> > >
> > > The oldest Att/Prop with the lowest hit count
> gets discarded by the new
> > > Att/Prop.
> > >
> > > This would be simple and fast with a constant
> speed penalty (ie
> > > independent on n).
> > >
> > > > > 4. Mike Nordell's suggestion of string =>
> integer
> > > > > translation. I was
> > > > > thinking of the same thing. Doing a MD5 sum
> might be
> > > > > overkill but
> > > > > something like that would guarantee a unique
> > > > > integer/string key. The
> > > > > integer keys could be stored in a binary
> vector
> > > > > which could be searched
> > > > > at order log(n) time. If we're clever with
> the data
> > > > > structures we might
> > > > > be able to do insertions into the array at
> order
> > > > > log(n) time too.
> > > >
> > > > MD5Sums and integer IDs break down quickly in
> certain
> > > > cases. They're acceptable if we define the
> granularity
> > > > on an appropriate level, and make some other
> > > > hard-and-set decisions. Currently, all these
> things
> > > > look alike to the PT, and are stored in the
> same
> > > > PP_Attribute:
> > > >
> > > > "font:Arial;size:12pt"
> > > > "size:12pt;font:Arial"
> > > > "font:Arial ;size:12pt"
> > > >
> > >
> > > Currently we treat all these as unique AP's. My
> suggestion for this
> > > technique would continue to treat these are
> different even though as you
> > > say, they're sematically the same. My GUESS is
> that it is not worth the
> > > penalty to pull out the semantic information and
> order the props
> > > accordingly. However I could be wrong and would
> be
=== message truncated ===

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

Next message: Johnny Lee: "Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"
Previous message: Johnny Lee: "Light at the end of the tunnel Was Re: 5291 - A Task Force"
In reply to: Johnny Lee: "Light at the end of the tunnel Was Re: 5291 - A Task Force"
Next in thread: Johnny Lee: "Re: Light at the end of the tunnel Was Re: 5291 - A Task Force"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Fri Sep 26 2003 - 16:07:28 EDT