From: Dom Lachowicz (domlachowicz@yahoo.com)
Date: Thu Sep 25 2003 - 12:13:01 EDT
> 1. Just #ifdef out the comparison in addIfUnique so
> we just add the
> strings. This provides no penalty for speed at all.
> The downside is that
> the attributes/properties grow linearly with
> document size and every
> time some presses a key.
I agree with Tomas. This is a bad idea.
> 2. Be selective about addIfUnique. We don't do the
> ifUnique upon
> document loading but keep it during editing. This
> way the document
> doesn't grow any bigger than needed during editing
> but we keep the fast
> load.
This sounds better than the above, but is probably a
bad idea for similar reasons that Tomas gave to the
above.
> 3. Have a cache of most recently used
> Attribute/Properties. We do a
> comparison with the properties in the cache and add
> them if the
> properties are not in the cache. This gives us a
> constant penalty with
> some trade-off of document size versus speed.
I think that this is going to hurt us
performance-wise. I don't see how one can make the
assumption that one is likely to hit duplicate styles
within short spans of time. Even still, we'd still
need a method to compare the attributes, and a
mechanism for passivating parts of the cache. I
haven't done any thorough testing on this, but it
seems to have all of the above flaws plus some new
ones of its own.
> 4. Mike Nordell's suggestion of string => integer
> translation. I was
> thinking of the same thing. Doing a MD5 sum might be
> overkill but
> something like that would guarantee a unique
> integer/string key. The
> integer keys could be stored in a binary vector
> which could be searched
> at order log(n) time. If we're clever with the data
> structures we might
> be able to do insertions into the array at order
> log(n) time too.
MD5Sums and integer IDs break down quickly in certain
cases. They're acceptable if we define the granularity
on an appropriate level, and make some other
hard-and-set decisions. Currently, all these things
look alike to the PT, and are stored in the same
PP_Attribute:
"font:Arial;size:12pt"
"size:12pt;font:Arial"
"font:Arial ;size:12pt"
Do we want all of these treated the same?
Semantically, they're all identical. As far as strcmp,
MD5, etc... goes, they're all radically different
strings.
If we want to treat these things as different, then
using integer IDs is easy. There's no need for MD5sums
- 'i++;' will do quite nicely. However, if we treat
these as different, the memory usage will likely go
up. By how much (or little) remains to be seen.
Finally, these suggestions, while productive, might
not be the right place to solve this problem. We have
quadratic behavior in our comparison routine. We must
perform 47,000,000 comparisons from something that's
only called 27,000 times. Yes, strcmp is a little
wasteful. Is the solution to replace strcmp with an
integer comparison, or to find a solution that doesn't
require calling strcmp (or an equivalent integer
comparison) 47 million times?
Discuss amongst yourselves.
Cheers,
Dom
__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com
This archive was generated by hypermail 2.1.4 : Thu Sep 25 2003 - 12:27:52 EDT