Re: 5291 - A Task Force

From: Dom Lachowicz (domlachowicz@yahoo.com)
Date: Thu Sep 25 2003 - 12:13:01 EDT

  • Next message: Mike Nordell: "Re: 5291 - A Task Force"

    > 1. Just #ifdef out the comparison in addIfUnique so
    > we just add the
    > strings. This provides no penalty for speed at all.
    > The downside is that
    > the attributes/properties grow linearly with
    > document size and every
    > time some presses a key.

    I agree with Tomas. This is a bad idea.
     
    > 2. Be selective about addIfUnique. We don't do the
    > ifUnique upon
    > document loading but keep it during editing. This
    > way the document
    > doesn't grow any bigger than needed during editing
    > but we keep the fast
    > load.

    This sounds better than the above, but is probably a
    bad idea for similar reasons that Tomas gave to the
    above.
     
    > 3. Have a cache of most recently used
    > Attribute/Properties. We do a
    > comparison with the properties in the cache and add
    > them if the
    > properties are not in the cache. This gives us a
    > constant penalty with
    > some trade-off of document size versus speed.

    I think that this is going to hurt us
    performance-wise. I don't see how one can make the
    assumption that one is likely to hit duplicate styles
    within short spans of time. Even still, we'd still
    need a method to compare the attributes, and a
    mechanism for passivating parts of the cache. I
    haven't done any thorough testing on this, but it
    seems to have all of the above flaws plus some new
    ones of its own.

    > 4. Mike Nordell's suggestion of string => integer
    > translation. I was
    > thinking of the same thing. Doing a MD5 sum might be
    > overkill but
    > something like that would guarantee a unique
    > integer/string key. The
    > integer keys could be stored in a binary vector
    > which could be searched
    > at order log(n) time. If we're clever with the data
    > structures we might
    > be able to do insertions into the array at order
    > log(n) time too.

    MD5Sums and integer IDs break down quickly in certain
    cases. They're acceptable if we define the granularity
    on an appropriate level, and make some other
    hard-and-set decisions. Currently, all these things
    look alike to the PT, and are stored in the same
    PP_Attribute:

    "font:Arial;size:12pt"
    "size:12pt;font:Arial"
    "font:Arial ;size:12pt"

    Do we want all of these treated the same?
    Semantically, they're all identical. As far as strcmp,
    MD5, etc... goes, they're all radically different
    strings.

    If we want to treat these things as different, then
    using integer IDs is easy. There's no need for MD5sums
    - 'i++;' will do quite nicely. However, if we treat
    these as different, the memory usage will likely go
    up. By how much (or little) remains to be seen.

    Finally, these suggestions, while productive, might
    not be the right place to solve this problem. We have
    quadratic behavior in our comparison routine. We must
    perform 47,000,000 comparisons from something that's
    only called 27,000 times. Yes, strcmp is a little
    wasteful. Is the solution to replace strcmp with an
    integer comparison, or to find a solution that doesn't
    require calling strcmp (or an equivalent integer
    comparison) 47 million times?

    Discuss amongst yourselves.

    Cheers,
    Dom

    __________________________________
    Do you Yahoo!?
    The New Yahoo! Shopping - with improved product search
    http://shopping.yahoo.com



    This archive was generated by hypermail 2.1.4 : Thu Sep 25 2003 - 12:27:52 EDT