5291 - A Task Force

From: Martin Sevior (msevior@seviorpc.ph.unimelb.edu.au)
Date: Wed Sep 24 2003 - 21:56:08 EDT

  • Next message: Torstein Sunde: "Enchant - list available dictionaries"

    On Thu, 2003-09-25 at 03:09, Dom Lachowicz wrote:
    > --- Raphael Finkel <raphael@cs.uky.edu> wrote:
    > > It sounds like hashing is just the thing you want.
    > > Every string gets
    > > hashed (can be fast), placed in a hash table
    > > (resolving collisions
    > > somehow; I prefer external chaining). Checking for
    > > equality and
    > > insertion in the table each requires about O(1) time
    > > (the time depends
    > > on the string length, whether your table is big
    > > enough, and how unlucky
    > > you are about collisions). It scales linearly with
    > > document length, not
    > > quadratically. It's a well-known technique.
    >
    > Again - these strings are already in a hash. I've
    > tried speeding things up by plunking them out of a
    > hash table. In reality, the speed of the hashing
    > algorithm + returning the associated piece of data is
    > roughly equal to the speed of several strcmps.
    >
    > People, please stop stabbing wildly and blindly at the
    > problem. Look at the code in question. Then post
    > things to the ML.
    >

    I want to apologize for suggesting string compression. It was a crazy
    idea.

    Yes. I agree we should do a thorough investigation. Here are 3 things I
    think we could do.

    1. Just #ifdef out the comparison in addIfUnique so we just add the
    strings. This provides no penalty for speed at all. The downside is that
    the attributes/properties grow linearly with document size and every
    time some presses a key.

    2. Be selective about addIfUnique. We don't do the ifUnique upon
    document loading but keep it during editing. This way the document
    doesn't grow any bigger than needed during editing but we keep the fast
    load.

    3. Have a cache of most recently used Attribute/Properties. We do a
    comparison with the properties in the cache and add them if the
    properties are not in the cache. This gives us a constant penalty with
    some trade-off of document size versus speed.

    4. Mike Nordell's suggestion of string => integer translation. I was
    thinking of the same thing. Doing a MD5 sum might be overkill but
    something like that would guarantee a unique integer/string key. The
    integer keys could be stored in a binary vector which could be searched
    at order log(n) time. If we're clever with the data structures we might
    be able to do insertions into the array at order log(n) time too.

    5. Other ideas...

    We should really do a thorough investigation of all the options. I
    suggest we organize a bug 5290 task force consisting of people who want
    to solve the problem. You can help by either writing code to implement a
    solution or by testing a proposed solution in terms of speed and memory
    consumption.

    This thread has generated a lot of interest. I think it would be fun to
    try out solutions. Who wants to be involved?

    Cheers

    Martin



    This archive was generated by hypermail 2.1.4 : Wed Sep 24 2003 - 21:09:03 EDT