Re: AbiSource, Ispell, Aspell and beyond...


Subject: Re: AbiSource, Ispell, Aspell and beyond...
From: Kevin Atkinson (kevinatk@home.com)
Date: Fri Feb 25 2000 - 21:09:18 CST


On Fri, 25 Feb 2000, Paul Rohr wrote:

> At 04:54 AM 2/23/00 -0500, Kevin Atkinson wrote:
> >> Aspell needs more info than a simple minded spell checker to do its job
> >> as well as it can, like communicating back the replacement pairs.
> >
> >This is needed becuase aspell is able to learn from users
> >missspellings. For example on the first pass a user misspelles
> >beginning as beging so aspell suggests:
> >
> > begging, begin, being, Beijing, bagging, ....
> >
> >However the user then tries "begning" and aspell suggests
> >
> > beginning, beaning, begging, ...
> >
> >so the user selects beginning. However than, latter on in the document
> >the user misspelles it as begng (NOT beging). Normally aspell will
> >suggest.
> >
> > began, begging, begin, begun, ....
> >
> >However becuase it knows the user mispelled beginning as beging it will
> >instead suggest:
> >
> > beginning, began, begging, begin, begun ...
> >
> >BTW I often misspelled beginning (and still do) as something close to
> >begging and two many times wind up writing sentences such as "begging
> >with ....".
>
> Hmm. That's an interesting approach. I hadn't thought of feeding the
> corrections chosen back into the engine to help improve the quality of
> future suggestions. I guess I'm a good enough speller already that I've
> never needed adaptive algorithms like that.
>
> Your examples still make it sound a bit like black magic, though. :-)
>

Here is how it is done.

6. How Aspell Works

The magic behind my spell checker comes from merging Lawrence Philips
excellent metaphone algorithm and Ispell's near miss strategy which is
inserting a space or hyphen, interchanging two adjacent letters, changing
one letter, deleting a letter, or adding a letter.

The process goes something like this.

 1. Convert the misspelled word to its soundslike equivalent (its metaphone
    for English words).
 2. Find all words that have a soundslike within one or two edit distances
    from the original words soundslike. The edit distance is the total
    number of deletions, insertions, exchanges, or adjacent swaps needed to
    make one string equivalent to the other. When set to only look for
    soundslikes within one edit distance it tries all possible soundslike
    combinations and check if each one is in the dictionary. When set to
    find all soundslike within two edit distance it scans through the
    entire dictionary and quickly scores each soundslike. The scoring is
    quick because it will give up if the two soundslikes are more than two
    edit distances apart.
 3. Find misspelled words that have a correctly spelled replacement by the
    same criteria of step number 2 and 3. That is the misspelled word in
    the word pair (such as teh -> the) would appear in the suggestions list
    as if it was a correct spelling.
 4. Score the result list and return the words with the lowest score. The
    score is roughly the weighed average of the weighed edit distance of
    the word to the misspelled word and the soundslike equivalent of the
    two words. The weighted edit distance is like the edit distance except
    that the various edits have weights attached to them.
 5. Replace the misspelled words that have correctly spelled replacements
    with their replacements and remove any duplicates that might arise
    because of this.

Please note that the soundslike equivalent is a rough approximation of how
the words sounds. It is not the phoneme of the word by any means. For more
details about exactly how each step is performed please see the file
suggest.cc. For more information on the metaphone algorithm please see the
data file english_phonet.dat.

> >(Yes its nasty C++ but it should be easy to uglify it
> >to your portably standards. -)
>
> Kevin, do I sound this snide to you? If so, please let me know. I really
> don't want to come across as a totally grumpy guy.
>

No, I just don't like the portability standards. It has NOTHING to do with
you but AbiSource, Mozilla, etc. viewpoint on how using such a low level
of C++. I understand your reasons however I gcc has come along quite a
bit sense the standards were originally advised and gcc will compile on
almost all platforms except the mac.....

> >So
> >the point is if you REALLY want a device dependent dictionary format you
> >can provide aspell with one. However as I said before I no NOT think
> >this is the right path to chose. Compiling the word list is a VERY easy
> >task in aspell and moderately easy in ispell.
>
> I think we're in violent agreement here. Having device-dependent dictionary
> formats is a Bad Thing, right?

Oups. device independent. I am not violently against device independent I
just think you put two much value in it.

> I don't think *any* of us really understand the existing ispell code well
> enough to help you isolate specific algorithms from it. A bunch of us have
> whittled around the edges on an as-needed basis, and then got out of there
> as fast as we could.

Ok, than the more I try to poke at ispell the more I think this is a bad
idea.

---
Kevin Atkinson
kevinatk@home.com
http://metalab.unc.edu/kevina/



This archive was generated by hypermail 2b25 : Fri Feb 25 2000 - 21:06:34 CST