more hints (was Re: Using of system-wide ispell dictionaries)

From: Paul Rohr (paul@abisource.com)
Date: Mon Jun 03 2002 - 12:01:31 EDT

Next message: Alan Horkan: "Re: Using of system-wide ispell dictionaries"

Previous message: Paul Rohr: "Re: Using of system-wide ispell dictionaries"
In reply to: Paul Rohr: "Re: Using of system-wide ispell dictionaries"
Next in thread: Alan Horkan: "Re: Using of system-wide ispell dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

At 08:32 AM 6/3/02 -0700, I wrote:
>However, I'd also like to point out that this entire category of problems
>could be solved *permanently* by a sufficiently-clever programmer who's
>willing to:
>
> Do the ugly, thankless work needed to allow us to load *any*
> commonly-available variant of the ispell hash formats, instead
> of just one.
>
>For some hints on the work required to do so, see:
>
> http://www.abisource.com/mailinglists/abiword-dev/01/April/1030.html
> http://www.abisource.com/mailinglists/abiword-dev/01/March/0769.html
>
>To date, we haven't found any volunteers who are both brave enough and
>talented enough to tackle this, but I'm still hopeful that we will. :-)

Bummer. I just Googled myself *after* hitting send -- doh! -- and realized
I'd missed the following hints:

http://www.abisource.com/mailinglists/abiword-dev/01/May/0251.html
http://www.abisource.com/mailinglists/abiword-dev/01/May/0151.html

As alluded to in that thread, there are (at least) three factors that affect
the variability of ispell hash file formats:

bits/flags (aka MASKBITS)/characters

The specific permutations found in the wild tend to vary a lot -- some
distros ship 8/56/128 hashes, others ship 7/26/100, and so on.

Insofar as we *already* have code which allows for variability in the
*middle* of these -- we handle 8/N/100 hashes (for N <= 64), my suggestion
is essentially to add code for even more flexibility here. For instance, we
could probably get very very far if we could handle any of the following
permutations:

B/N/C (for say, B = 7|8; N <= 64; C <= 128)

The key insight here remains:

  1. ispell hashes define a family of *very* closely-related file formats.
  2. The variances are simply different values of known #defines.
  3. These variances change the widths of key structs on disk.
  4. The file format includes sanity check fields which explicitly tell
      you which #defines were used.

In short, #1-4 provide all the information needed for a smart hashfile
loader to determine which variant of the file format is being read. The
problem of all legacy ispell implementations is that they do something
incredibly dumb at this point:

5. Try to do a struct copy (!!) from disk to memory using a specific
permutation of #defines.

6. *Recognize* the existence of other valid #define permutations...
and refuse to load those files at all!

I claim that it's much easier to do something sufficiently smart here than
it is to, say, reverse-engineer the family of Word binary file formats.

;-)

Paul,
design evangelist

Next message: Alan Horkan: "Re: Using of system-wide ispell dictionaries"
Previous message: Paul Rohr: "Re: Using of system-wide ispell dictionaries"
In reply to: Paul Rohr: "Re: Using of system-wide ispell dictionaries"
Next in thread: Alan Horkan: "Re: Using of system-wide ispell dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Mon Jun 03 2002 - 12:05:33 EDT