more hints (was Re: Using of system-wide ispell dictionaries)

From: Paul Rohr (paul@abisource.com)
Date: Mon Jun 03 2002 - 12:01:31 EDT

  • Next message: Alan Horkan: "Re: Using of system-wide ispell dictionaries"

    At 08:32 AM 6/3/02 -0700, I wrote:
    >However, I'd also like to point out that this entire category of problems
    >could be solved *permanently* by a sufficiently-clever programmer who's
    >willing to:
    >
    > Do the ugly, thankless work needed to allow us to load *any*
    > commonly-available variant of the ispell hash formats, instead
    > of just one.
    >
    >For some hints on the work required to do so, see:
    >
    > http://www.abisource.com/mailinglists/abiword-dev/01/April/1030.html
    > http://www.abisource.com/mailinglists/abiword-dev/01/March/0769.html
    >
    >To date, we haven't found any volunteers who are both brave enough and
    >talented enough to tackle this, but I'm still hopeful that we will. :-)

    Bummer. I just Googled myself *after* hitting send -- doh! -- and realized
    I'd missed the following hints:

      http://www.abisource.com/mailinglists/abiword-dev/01/May/0251.html
      http://www.abisource.com/mailinglists/abiword-dev/01/May/0151.html

    As alluded to in that thread, there are (at least) three factors that affect
    the variability of ispell hash file formats:

      bits/flags (aka MASKBITS)/characters

    The specific permutations found in the wild tend to vary a lot -- some
    distros ship 8/56/128 hashes, others ship 7/26/100, and so on.

    Insofar as we *already* have code which allows for variability in the
    *middle* of these -- we handle 8/N/100 hashes (for N <= 64), my suggestion
    is essentially to add code for even more flexibility here. For instance, we
    could probably get very very far if we could handle any of the following
    permutations:

      B/N/C (for say, B = 7|8; N <= 64; C <= 128)

    The key insight here remains:

      1. ispell hashes define a family of *very* closely-related file formats.
      2. The variances are simply different values of known #defines.
      3. These variances change the widths of key structs on disk.
      4. The file format includes sanity check fields which explicitly tell
          you which #defines were used.

    In short, #1-4 provide all the information needed for a smart hashfile
    loader to determine which variant of the file format is being read. The
    problem of all legacy ispell implementations is that they do something
    incredibly dumb at this point:

      5. Try to do a struct copy (!!) from disk to memory using a specific
          permutation of #defines.

      6. *Recognize* the existence of other valid #define permutations...
          and refuse to load those files at all!

    I claim that it's much easier to do something sufficiently smart here than
    it is to, say, reverse-engineer the family of Word binary file formats.

    ;-)

    Paul,
    design evangelist



    This archive was generated by hypermail 2.1.4 : Mon Jun 03 2002 - 12:05:33 EDT