Re: 5291 - A Task Force

From: Dom Lachowicz (domlachowicz@yahoo.com)
Date: Thu Sep 25 2003 - 13:49:16 EDT

Next message: Dom Lachowicz: "Re: 5291 - A Task Force"

Previous message: Mike Nordell: "Re: 5291 - A Task Force"
In reply to: Mike Nordell: "Re: 5291 - A Task Force"
Next in thread: Dom Lachowicz: "Re: 5291 - A Task Force"
Next in thread: Ted Lemon: "Re: 5291 RTF import slow - the solution :-)"
Reply: Dom Lachowicz: "Re: 5291 - A Task Force"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi Mike,

> I'd say: Yes. All two of them should internally be
> represented in the same
> way, and when written out to a (possibly new) file
> they should be
> normalized.

Some pre-processing normalization might help here.
Like stripping out extraneous spaces, ordering props
alphabetically, etc... Then again, it might be
expensive to do such a thing.

> Indeed! It also displays a crucial flaw in my idea.
> Yes, a single 32-bit
> value would still be good to represent such
> "combined" props, but how to
> represent it?

See above for some proposed "intelligence". However,
you need to make sure that your ordering + integer
assignment phase is faster than what we're doing now.
This isn't clear to me to be the case (though it may
very well be).

Since we've got 27000 add/store cases, that will
result in 27000 re-orderings, plus 47,000,000 integer
comparisons. The cost of handing out a new integer
will be constant-time (i++). It all hinges on how
expensive the reordering phase is. A quick test could
be to make use of FJF's new string map class which
will take a CSS props block and then you sort its
contents.

> > We have
> > quadratic behavior in our comparison routine. We
> must
> > perform 47,000,000 comparisons from something
> that's
> > only called 27,000 times.
>
> That can't be right, can it? 27K^2 = 729K, more than
> an order of magnitude
> worse (like it wasn't bad anough already ;-) ).

O(n^2), not n^2. What we have is a situation like
this:

For the first item, the list is empty. Insert it. For
the second item, you must do 1 comparison to see if
it's a duplicate. For the second, you must do (on
average) 1, but maybe 2. You'll see that you have an
alogorithm that looks like this for an "average" case
where you're as likely to find something in the first
half as you are in the second half:

27000/2 + 26999/2 + 26998/2 + ...
or
(n/2) + ((n-1)/2) + ((n-2)/2) + ...

So you end up having things that are roughly (n *
(n/2)) operations, or O(n^2) behavior. In any case,
it's bad.

> > Yes, strcmp is a little
> > wasteful. Is the solution to replace strcmp with
> an
> > integer comparison, or to find a solution that
> doesn't
> > require calling strcmp (or an equivalent integer
> > comparison) 47 million times?
>
> That's rethoric. :-) What is really want to ask is:
> Do you know of an algorithm to take AW away from
> this n^2-hell? If so, speak
> up!

Not so much rhetoric as me being genuinely confused as
to a good course of action. If we can do better than
O(n^2), that's great. I'm not sure we can. But I
haven't given it enough thought.

If that fials, we might turn that strcmp into a if()
statement. That _might_ be a big boost too. That is,
*assuming that normalization of properties and
assignment of integer ids proves fast*. I mean fast
here in terms of both BIG-O complexity, and in
wall-clock times, with a deference given to wall-clock
times since our users really are looking at their
watches at this point.

> I myself hasn't been in the loop for quite some
> time, but I do recognize the
> same problems popping up over and over - and this is
> one of'em. I believe
> trying to remove that n^2 property is a major
> concern. Should that prove
> (too) difficult, at least going from O=str(i)cmp to
> O=integer_comparison
> would probably buy a *lot*.

It might. It's worth prototyping at least. If each by
itself proves worthwhile, combining the two would
probably be an even bigger win.

Dom

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

Next message: Dom Lachowicz: "Re: 5291 - A Task Force"
Previous message: Mike Nordell: "Re: 5291 - A Task Force"
In reply to: Mike Nordell: "Re: 5291 - A Task Force"
Next in thread: Dom Lachowicz: "Re: 5291 - A Task Force"
Next in thread: Ted Lemon: "Re: 5291 RTF import slow - the solution :-)"
Reply: Dom Lachowicz: "Re: 5291 - A Task Force"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Thu Sep 25 2003 - 14:04:37 EDT