Re: 5291 - A Task Force

From: Martin Sevior (msevior@seviorpc.ph.unimelb.edu.au)
Date: Thu Sep 25 2003 - 22:14:27 EDT

  • Next message: Martin Sevior: "commit: Basic XHTML table import."

    On Fri, 2003-09-26 at 02:13, Dom Lachowicz wrote:
    > > 1. Just #ifdef out the comparison in addIfUnique so
    > > we just add the
    > > strings. This provides no penalty for speed at all.
    > > The downside is that
    > > the attributes/properties grow linearly with
    > > document size and every
    > > time some presses a key.
    >
    > I agree with Tomas. This is a bad idea.
    >
    > > 2. Be selective about addIfUnique. We don't do the
    > > ifUnique upon
    > > document loading but keep it during editing. This
    > > way the document
    > > doesn't grow any bigger than needed during editing
    > > but we keep the fast
    > > load.
    >
    > This sounds better than the above, but is probably a
    > bad idea for similar reasons that Tomas gave to the
    > above.

    I have played with this a bit. It has the downside of doing more lookups
    during editting but the loading is much faster and is really easy to
    implement. I'm about to commit a patch that implements this as the
    default until we come up with something better. Feel free to flame we
    with quantitative reasons (ie some memory usage statistics before and
    after) why we should not go this route.

    (I'm not doing this because I think it's best, only because it's better
    than what we have now.)
    >
    > > 3. Have a cache of most recently used
    > > Attribute/Properties. We do a
    > > comparison with the properties in the cache and add
    > > them if the
    > > properties are not in the cache. This gives us a
    > > constant penalty with
    > > some trade-off of document size versus speed.
    >
    > I think that this is going to hurt us
    > performance-wise. I don't see how one can make the
    > assumption that one is likely to hit duplicate styles
    > within short spans of time. Even still, we'd still
    > need a method to compare the attributes, and a
    > mechanism for passivating parts of the cache. I
    > haven't done any thorough testing on this, but it
    > seems to have all of the above flaws plus some new
    > ones of its own.
    >

    I think a simple buffer with a couple of heuristics could do rather
    well. We could play with the cache size to see how much effect that has.

    The heuristics would be: Every time an Att/Prop that does not find a
    match in the cache it replaces a Att/Prop in the cache and gets a new
    indexAP.

    Att/Props in the cache which get matched the score count incremented by
    one.

    The oldest Att/Prop with the lowest hit count gets discarded by the new
    Att/Prop.

    This would be simple and fast with a constant speed penalty (ie
    independent on n).

    > > 4. Mike Nordell's suggestion of string => integer
    > > translation. I was
    > > thinking of the same thing. Doing a MD5 sum might be
    > > overkill but
    > > something like that would guarantee a unique
    > > integer/string key. The
    > > integer keys could be stored in a binary vector
    > > which could be searched
    > > at order log(n) time. If we're clever with the data
    > > structures we might
    > > be able to do insertions into the array at order
    > > log(n) time too.
    >
    > MD5Sums and integer IDs break down quickly in certain
    > cases. They're acceptable if we define the granularity
    > on an appropriate level, and make some other
    > hard-and-set decisions. Currently, all these things
    > look alike to the PT, and are stored in the same
    > PP_Attribute:
    >
    > "font:Arial;size:12pt"
    > "size:12pt;font:Arial"
    > "font:Arial ;size:12pt"
    >

    Currently we treat all these as unique AP's. My suggestion for this
    technique would continue to treat these are different even though as you
    say, they're sematically the same. My GUESS is that it is not worth the
    penalty to pull out the semantic information and order the props
    accordingly. However I could be wrong and would be happy to see some
    test data to show otherwise.

    > Do we want all of these treated the same?
    > Semantically, they're all identical. As far as strcmp,
    > MD5, etc... goes, they're all radically different
    > strings.
    >

    Yes.

    > If we want to treat these things as different, then
    > using integer IDs is easy. There's no need for MD5sums
    > - 'i++;' will do quite nicely.

    I don't get this. My proposal would be that every new new string gets a
    one computation to turn it into a unique integer.

    eg strings from the importers appear in this order and get translated
    into a unique integer
    "font:Arial;size:12pt" => 2134563234
    "size:12pt;font:Arial" => 3124563234
    "font:Arial ;size:12pt" => 9234567234
    "find:Times New Roman; size:14pt" => 89345673234
    "font:Arial;size:12pt" => 2134563234

    Then instead of comparing strings we simply compare the keyed-integer's.
    If the integers match we have identical strings. This comparison is
    faster than string matching and has the additional advantage of allowing
    the strings to be ordered allowing a binary comparison or placed in a
    hash table.

    > However, if we treat
    > these as different, the memory usage will likely go
    > up. By how much (or little) remains to be seen.
    >
    > Finally, these suggestions, while productive, might
    > not be the right place to solve this problem. We have
    > quadratic behavior in our comparison routine. We must
    > perform 47,000,000 comparisons from something that's
    > only called 27,000 times. Yes, strcmp is a little
    > wasteful. Is the solution to replace strcmp with an
    > integer comparison, or to find a solution that doesn't
    > require calling strcmp (or an equivalent integer
    > comparison) 47 million times?
    >

    The above suggestions would all do this. (Well suggestion 2 would only
    solve the comparison problem during import.)

    > Discuss amongst yourselves.
    >

    We are :-)

    Cheers

    Martin



    This archive was generated by hypermail 2.1.4 : Thu Sep 25 2003 - 21:27:32 EDT