Re: 5291 RTF import slow - the solution :-)

From: msevior@physics.unimelb.edu.au
Date: Wed Sep 24 2003 - 07:13:37 EDT

  • Next message: Hubert Figuiere: "Re: 5291 RTF import slow - the solution :-)"

    >
    > The present system has two real weaknesses; one is the need to do
    > string comparisons when comparing properties, and the other is the fact
    > that we have to clone the string values when making copies,
    > passing properties through functions, etc. If the properties were stored
    > internally in a numerical format both of these would go away. I think
    > Hub's proposal should be considered.
    >
    > Tomas

    The problem is in pt_VarSet.cpp::addIfUnique(...)

    If instead of requiring each index to be unique we just add every property
    set we get a dramatic increase in loading speed at the cost of extra
    memory usage.

    What is happenning is that almost every cell in that huge 3000 cell table
    will give a unique indexAP since every one has different properties. But
    we don't know that until we've finished scanning the entire Att/Prop set.
    After that we add another.

    So we get a quadratic decrease in speed with document size.

    However if we put in the following ifdef....

    bool pt_VarSet::addIfUniqueAP(PP_AttrProp * pAP, PT_AttrPropIndex * papi)
    {
            // Add the AP to our tables iff it is unique.
            // If not unique, delete it and return the index
            // of the one that matches. If it is unique, add
            // it and return the index where we added it.
            // return false if we have any errors.

            UT_ASSERT(pAP && papi);
            UT_uint32 subscript = 0;
    #if 0
            UT_uint32 table = 0;

            for (table=0; table<2; table++)
                    if (m_tableAttrProp[table].findMatch(pAP,&subscript))
                    {
                            // use the one that we already have in the table.
                            delete pAP;
                            *papi = _makeAPIndex(table,subscript);
                            return true;
                    }

            // we did not find a match, so we store our new one.
    #endif
            if (m_tableAttrProp[m_currentVarSet].addAP(pAP,&subscript))
            {
                    *papi = _makeAPIndex(m_currentVarSet,subscript);
                    return true;
            }

            // memory error of some kind.
            UT_ASSERT(UT_SHOULD_NOT_HAPPEN);
            delete pAP;
            return false;
    }

    The document in bug 5290 loads into the piecetable in 8 seconds on my 1
    Ghz laptop as opposed to 40 seconds without the ifdef.

    The full document load (layout stops filling) takes 24 seconds with the
    ifdef 0 as opposed to 55 seconds without it.

    By comparison MS Word 2000 running under wine takes 4 seconds to load it's
    piecetable and a total of 11 seconds to fully load the document.
    (When it's layout structures stop filling.)

    Sorry I don't have Open Office on this machine for another comparison.

    I've done some quick tests on my machine and AbiWord still works fine with
    the ifdef 0. I guess we should do extensive tests and well as memory usage
    comparisons before and after the #ifdef.

    My guess is that the ifdef is worth it. The memory usage will grow
    linearly with document size as it does now from the storing of text and
    building layout structures.

    If we wish to save on memory we could always store the strings gzipped.

    So everyone, should we put in the #ifdef 0 code?

    Cheers

    Martin



    This archive was generated by hypermail 2.1.4 : Wed Sep 24 2003 - 07:33:30 EDT