From: Martin Sevior (msevior@seviorpc.ph.unimelb.edu.au)
Date: Wed Sep 24 2003 - 21:56:08 EDT
On Thu, 2003-09-25 at 03:09, Dom Lachowicz wrote:
> --- Raphael Finkel <raphael@cs.uky.edu> wrote:
> > It sounds like hashing is just the thing you want.
> > Every string gets
> > hashed (can be fast), placed in a hash table
> > (resolving collisions
> > somehow; I prefer external chaining). Checking for
> > equality and
> > insertion in the table each requires about O(1) time
> > (the time depends
> > on the string length, whether your table is big
> > enough, and how unlucky
> > you are about collisions). It scales linearly with
> > document length, not
> > quadratically. It's a well-known technique.
>
> Again - these strings are already in a hash. I've
> tried speeding things up by plunking them out of a
> hash table. In reality, the speed of the hashing
> algorithm + returning the associated piece of data is
> roughly equal to the speed of several strcmps.
>
> People, please stop stabbing wildly and blindly at the
> problem. Look at the code in question. Then post
> things to the ML.
>
I want to apologize for suggesting string compression. It was a crazy
idea.
Yes. I agree we should do a thorough investigation. Here are 3 things I
think we could do.
1. Just #ifdef out the comparison in addIfUnique so we just add the
strings. This provides no penalty for speed at all. The downside is that
the attributes/properties grow linearly with document size and every
time some presses a key.
2. Be selective about addIfUnique. We don't do the ifUnique upon
document loading but keep it during editing. This way the document
doesn't grow any bigger than needed during editing but we keep the fast
load.
3. Have a cache of most recently used Attribute/Properties. We do a
comparison with the properties in the cache and add them if the
properties are not in the cache. This gives us a constant penalty with
some trade-off of document size versus speed.
4. Mike Nordell's suggestion of string => integer translation. I was
thinking of the same thing. Doing a MD5 sum might be overkill but
something like that would guarantee a unique integer/string key. The
integer keys could be stored in a binary vector which could be searched
at order log(n) time. If we're clever with the data structures we might
be able to do insertions into the array at order log(n) time too.
5. Other ideas...
We should really do a thorough investigation of all the options. I
suggest we organize a bug 5290 task force consisting of people who want
to solve the problem. You can help by either writing code to implement a
solution or by testing a proposed solution in terms of speed and memory
consumption.
This thread has generated a lot of interest. I think it would be fun to
try out solutions. Who wants to be involved?
Cheers
Martin
This archive was generated by hypermail 2.1.4 : Wed Sep 24 2003 - 21:09:03 EDT