Re: 5291 - A Task Force

From: Mike Nordell (tamlin@algonet.se)
Date: Thu Sep 25 2003 - 12:59:48 EDT

  • Next message: Dom Lachowicz: "Re: 5291 - A Task Force"

    Dom Lachowicz wrote:

    > "font:Arial;size:12pt"
    > "size:12pt;font:Arial"
    > "font:Arial ;size:12pt"
    >
    > Do we want all of these treated the same?

    I'd say: Yes. All two of them should internally be represented in the same
    way, and when written out to a (possibly new) file they should be
    normalized.

    > Semantically, they're all identical. As far as strcmp,
    > MD5, etc... goes, they're all radically different
    > strings.

    Indeed! It also displays a crucial flaw in my idea. Yes, a single 32-bit
    value would still be good to represent such "combined" props, but how to
    represent it?

    Perhaps by adding some more "intelligence" to the function that must exist
    to convert string to integer? Ideally I'd even make that function split up
    e.g. "font" and "size" into different "properties" (internal only for this
    function), but just normalizing them before lookup/insertion could also
    work.

    Another, but less smart I believe, idea could be to hash on each one of
    these (as full strings in their presented form, meaning your example would
    have two forms), and once loading is complete a final collation run could be
    made. Yeah, I told you, it wasn't an especially bright idea even if
    possible.

    Anhother thing to consider could be to use a *very* simple hashing algo for
    the strings instead of going for something as heavy as SHA(1). So what if
    some buckets of the map got more than one entry, it's still be better than
    n^2.

    <snip>
    > We have
    > quadratic behavior in our comparison routine. We must
    > perform 47,000,000 comparisons from something that's
    > only called 27,000 times.

    That can't be right, can it? 27K^2 = 729K, more than an order of magnitude
    worse (like it wasn't bad anough already ;-) ).

    > Yes, strcmp is a little
    > wasteful. Is the solution to replace strcmp with an
    > integer comparison, or to find a solution that doesn't
    > require calling strcmp (or an equivalent integer
    > comparison) 47 million times?

    That's rethoric. :-) What is really want to ask is:
    Do you know of an algorithm to take AW away from this n^2-hell? If so, speak
    up!

    I myself hasn't been in the loop for quite some time, but I do recognize the
    same problems popping up over and over - and this is one of'em. I believe
    trying to remove that n^2 property is a major concern. Should that prove
    (too) difficult, at least going from O=str(i)cmp to O=integer_comparison
    would probably buy a *lot*.

    0.02
    /Mike



    This archive was generated by hypermail 2.1.4 : Thu Sep 25 2003 - 13:14:26 EDT