From: Mike Nordell (tamlin@algonet.se)
Date: Thu Sep 25 2003 - 12:59:48 EDT
Dom Lachowicz wrote:
> "font:Arial;size:12pt"
> "size:12pt;font:Arial"
> "font:Arial ;size:12pt"
>
> Do we want all of these treated the same?
I'd say: Yes. All two of them should internally be represented in the same
way, and when written out to a (possibly new) file they should be
normalized.
> Semantically, they're all identical. As far as strcmp,
> MD5, etc... goes, they're all radically different
> strings.
Indeed! It also displays a crucial flaw in my idea. Yes, a single 32-bit
value would still be good to represent such "combined" props, but how to
represent it?
Perhaps by adding some more "intelligence" to the function that must exist
to convert string to integer? Ideally I'd even make that function split up
e.g. "font" and "size" into different "properties" (internal only for this
function), but just normalizing them before lookup/insertion could also
work.
Another, but less smart I believe, idea could be to hash on each one of
these (as full strings in their presented form, meaning your example would
have two forms), and once loading is complete a final collation run could be
made. Yeah, I told you, it wasn't an especially bright idea even if
possible.
Anhother thing to consider could be to use a *very* simple hashing algo for
the strings instead of going for something as heavy as SHA(1). So what if
some buckets of the map got more than one entry, it's still be better than
n^2.
<snip>
> We have
> quadratic behavior in our comparison routine. We must
> perform 47,000,000 comparisons from something that's
> only called 27,000 times.
That can't be right, can it? 27K^2 = 729K, more than an order of magnitude
worse (like it wasn't bad anough already ;-) ).
> Yes, strcmp is a little
> wasteful. Is the solution to replace strcmp with an
> integer comparison, or to find a solution that doesn't
> require calling strcmp (or an equivalent integer
> comparison) 47 million times?
That's rethoric. :-) What is really want to ask is:
Do you know of an algorithm to take AW away from this n^2-hell? If so, speak
up!
I myself hasn't been in the loop for quite some time, but I do recognize the
same problems popping up over and over - and this is one of'em. I believe
trying to remove that n^2 property is a major concern. Should that prove
(too) difficult, at least going from O=str(i)cmp to O=integer_comparison
would probably buy a *lot*.
0.02
/Mike
This archive was generated by hypermail 2.1.4 : Thu Sep 25 2003 - 13:14:26 EDT