AbiSource, Ispell, Aspell and beyond...


Subject: AbiSource, Ispell, Aspell and beyond...
From: Kevin Atkinson (kevinatk@home.com)
Date: Sat Feb 19 2000 - 00:23:23 CST


Hi there!

For those of you who don't know I am the developer of Aspell
(http://aspell.sourceforge.net). The spell checker that comes up with
MUCH better suggestions than ispell -- or just about any other spell
checker I have seen -- but uses templates which are taboo according to
Darren O. Benham. Since integrating Aspell was out of the question
Justin Bradford was contemplating using the basic Aspell algorithm.

Well, I strongly don't recommend it. The algorithm is simple enough but
that is only half of the picture. The other half is the fine tuning
of parameters and the constant adjustments of the metaphone code, the
scoring algorithms, and the strategy used to get the other metaphones in
the first place among other things.

After using abiword and looking at the source it seams that instead of
using ispell though the pipe interface you decided to integrate ispell
directly in your program. This currently rules out using aspell at
ALL with abiword. However it seams that this is only a temporary thing
and, at least according to Justin Bradford, you plan to come up with a
more generic interface where other spell checkers can be
used... including Aspell.

So I plan on coming up with such an interface as a generic spell
checker library. The interface will be a total separate library which
will not be a part of aspell in any way. The library alone won't be
able to do anything without supporting modules. I plan on developing
the aspell module and am hoping someone else will develop the ispell
module as the ispell source is very messy and I am not well trained
at disentangling messy C code. AbiSource will simply link to the
generic library and the library will decide what spell checker to
use based on the users preference.

My eventual goal of this library is to get all programs which use
ispell through a pipe, merge ispell into there product, or come up
with there own spell checker to use the generic interface. Because it
is not AbiSource specific I will not lessen to suggestions to make
the code follow AbiSource coding standards except for ones involving
portability.

I attached a draft outline of what the interface will look like.

Please look over it at let me know what you think of the idea, the
interface, etc...

I also attached the relevant parts of the conversation on spell
checking in AbiSource which I found in the devil list archive. Let me
know if I missed anything important.

One last note. What is the big deal of the ebdian problem? Ispell
hash tables are machine specific just like the machine code is. Thus,
the hash table should be compiled just as if the source code would be.

A alpha version of the generic spell checker library as well as the
aspell module should be available within a month or so.

-
Kevin Atkinson
kevinatk@home.com
http://metalab.unc.edu/kevina/

// All class names will be prefixed by the name of my portable library
// or an abbreviation of it.

class Object {
  virtual Object * clone() const = 0;
  // if the two objects are not of the exact same type
  // the assign method is undefined.
  virtual void assign(const Object *) = 0;
  virtual int error_num() {return 0;}
  virtaul const char * error_message() {return "";}
  // string valid until the next error
  ~Object() {}
}

// An emulation is an efficient way to iterate through elements much
// like a forward iterator. The at_end method is a convince method
// as emulations will return a null pointer when they are at the end.
// Unlike an iterator iterating through x elements on a list can be
// done in x function calls while an iterator will require 3*x.
// function calls.
// Example of emulator usage
// const char * word;
// while ( (word = elements->next()) != 0) { // one function call
// cout << word << endl;
// }
// And an iterator
// iterator i = container->begin();
// iterator end = container->end();
// while (i != end) { // comparison, one function call
// cout << *i << endl; // deref, total of two function calls
// ++i; // increment, total of three function calls.
// }
// Normally all three calls are inline so it doesn't really matter
// but when the container type is not visible there are not inline
// and probably even virtual.
// If you really love iterators you can very easily wrap an emulation
// in a forward iterator.

// The strings return generally remain valid as long as the underlying
// container remains valid.

class StringEmulation : public Object {
public:
  virtual const char * next() = 0;
  virtual bool at_end() const = 0;
};

class StringPair {
  const char * first;
  const char * second;
}

class StringPairEmulation : public Object {
public:
  virtual StringPair next() = 0;
  virtual bool at_end() const = 0;
  virtual ~StringPairEmulation() {}
};

// Used by the Config class below...
class MutableContainer {
public:
  virtual void insert(const char *) = 0;
  virtual void remove(const char *) = 0;
  virtual void clear() = 0;
  virtual ~MutableContainer();
};

// NOTE: methods that return a bool generally return false on error and
// true other wise. To find out what went wrong use the
// error_num and error_message methods.
// Unless otherwise stated mathods that return a const char * will
// return null on error. The charter string returned is only valid
// until the next method which returns a const char * is called.

enum AddAction {Insert, Replace};

// A string map is a simple hash table where the key and values
// are strings. It also has the ability to write and read data
// files of a standard format.
// It is perfect for storing word pairs for "Replace All".
class StringMap : public Object {
public:
  PairEmulation * elements() const;
  // allocated with new

  virtual bool insert(const char * key, const char * value) = 0;
  // note: insert will NOT overwrite an existing entry
  virtual bool replace(const char * key, const char * value) = 0;

  virtual bool remove(const char * key) = 0;
  virtual const char * lookup(const char * key) const = 0;
  // the string returned is valid as long as the underlying value
  // does not change
  virtual bool have(const char * key) const = 0;
  virtual void clear() = 0;

  virtual bool merge(const StringMap & other);
  
  virtual bool read_in_stream(istream &, char delim = '\n',
                              AddAction a = Replace);
  virtual bool read_in_file(const char *, AddAction a = Replace);
  virtual bool read_in_string(const char *, AddAction a = Replace);
  
  virtual bool write_to_stream(ostream &) const;
  virtual bool write_to_file(const char *) const;
};

struct KeyInfo {
  const char * name;
  enum Type {Bool, String, Int, List};
  Type type;
  const char * def;
  const char * desc; // null if internal value
};

class KeyInfoEmulation {
...
};

// The Config class is used to hold configuration information.
// it has a set of keys which it will except. Inserting are even
// trying to look at a key that it does not know will produce
// an error. Extra accepted keys can be added with the set_extra
// method.
class Config : public StringMap {
public:
private:
  void set_extra(const KeyInfo * begin, const KeyInfo * end);
  
  virtual const KeyInfo * keyinfo(const char * key) const = 0
  virtual KeyInfoEmulation * possible_elements(bool include_extra = true) const = 0;
  // allocates with new
  
  virtual const char * get_default(const char * key) const = 0;

  // these unlike lookup will
  // a) return the default if the value is not set
  // b) give an error if the key is not requested as known
  // c) give an error if the value is not in the right format

  virtual const char * retrieve (const char * key) const = 0;
  virtual const char * retrieve_list (const char * key) const = 0;

  virtual bool retrieve_list (const char * key, MutableContainer &) const = 0;

  virtual int retrieve_bool(const char * key) const = 0;
  // return -1 on error, 0 if false, 1 if true

  virtual int retrieve_int(const char * key) const = 0;
  // return -1 on error

  // This will read in the configuration from a set of files and
  // environmental variables specific to the particular spell checker
  // used.
  bool read_in();
};

typedef unsigned short ShortUniChar;
typedef unsigned int UniChar;

// This class is responsible for keeping track of the dictionaries
// coming up with suggestions and the like
// Its methods are NOT meant to be used my multiple threads and/or
// documents.
// Most all if the manipulation of options is done via the Config
// class, thus this class has precious few methods

class Manager : public Object {

  virtual Config & config() = 0;
  virtual const Config & config () const = 0;
  
  virtual const char * lang_name() const = 0;

  bool check(const char * word) const;
  bool check(const ShortUniChar * word) const;
  bool check(const UniChar * word) const;

  bool add_to_personal(const char *);
  bool add_to_session(const char *);

  bool add_to_personal(const ShortUniChar *);
  bool add_to_session(const ShortUniChar *);

  bool add_to_personal(const UniChar *);
  bool add_to_session(const UniChar *);
  
  bool save_all_wls();

  void clear_session();
  
  SuggestionList & suggest(const char * word);
  SuggestionList & suggest(const ShortUniChar * word);
  SuggestionList & suggest(const UniChar * word);
  // the suggestion list and the elements in it are only
  // valid until the next call to suggest.
  
  bool store_repl(const char * mis, const char * cor);
  bool store_repl(const ShortUniChar * mis, const ShortUniChar * cor);
  bool store_repl(const UniChar * mis, const UniChar * cor);

};

class SuggestionListEmulation {
  virtual const char * next_string() = 0;
  virtual const char * next_short_uni_string() = 0;
  virtual const char * next_uni_string() = 0;
  virtual bool at_end() const = 0;
}

class SuggestionList : public Object {
public:
  virtual bool empty() const = 0;
  virtual int size() const = 0;
  virtual SuggestionListEmulation * elements() const = 0;
};

// There will also be a bunch of functions that will return various
// of the above classes allocated with new.

// Stuff like sharing dictionaries between different Managers and the
// like will be handles by these functions and by setting parameters
// in the Config class.

// Should I provide classes to directly access the individual word lists?

// There will also be classes to spell string complete documents
// which will allow the spell checker to skip over TeX commands, HTML
// tags, etc...

  

Justin Bradford (justin@ukans.edu)
Thu, 15 Jul 1999 19:16:37 -0500 (CDT)

With some of the problems using ispell lately, I was wondering if anyone
had evaluated aspell as a replace? I know it is supposed to be much better
than ispell at predicting replacements (roughly on par with Word, if not
a bit better).

Is it i18n and l10n problems?

Of course, I'm not even sure it solves the endian problems, but I know it
has several advantages, and was under the impression that ispell would be
"replaced" by aspell, eventually (as in no one continuing to work on
ispell).

Have I asked this before here? It seems like I remember asking this early
on...

Darren O. Benham (gecko@benham.net)
Thu, 15 Jul 1999 17:19:15 -0700

Aspell uses templates... templates are taboo...

Subject: various questions
Justin Bradford (justin@ukans.edu)
Sun, 19 Sep 1999 04:55:59 -0500 (CDT)

...

3. Spell checking

I have local modifications to ispell reintegrating its simple word
suggestion code. My plan was to make use of this in the dialog
and possibly in a right-click menu on squiggled words.
That and the dialog ought to hit the tree tomorrow.

Also, I've looked through aspell, and the trick to it's good suggestions
is a combination of ispell's method and metaphones. It doesn't look too
hard to integrate, but the metaphone transformations will be language
dependent. We could probably rig up an external ruleset which would
allow localizers to implement there a new language for spell check.

And, we should have a way to preserve "ignore all" words for at least
a session (possibly store in the file, too?), and the stripped ispell
should be expanded to handle user-defined dictionaries.

As these get done, a generic spelling interface should get built, which
could then be replaced by another spelling checker.

...

Subject: Spell checking (was Re: various questions)
Paul Rohr (paul@abisource.com)
Tue, 21 Sep 1999 18:46:48 -0700

At 04:55 AM 9/19/99 -0500, Justin Bradford wrote:
>3. Spell checking
>
> I have local modifications to ispell reintegrating its simple word

Cool. For some reason the guy who did the original ispell integration left
out that functionality, which never made sense to me. It'll be good to have
it back. :-)

> Also, I've looked through aspell, and the trick to it's good suggestions

Sounds intriguing. Remember, though, that the longstanding objection to
aspell is its use of advanced C++ features like templates, which greatly
reduce portability.

We'd love to have a cooler engine than ispell, and aspell's results sure
look cool, but given all the platforms people want to run AbiWord on, the
portability problem is a biggie. I suspect that the problems of generating
and distributing aspell-format dictionaries for various languages pale in
comparison to this.

> And, we should have a way to preserve "ignore all" words for at least

For both personal dictionaries and ignore lists, we need to decide two
things:

1. how they're stored in memory, and
2. how they persist.

Since ignore lists tend to be small, actually using ispell to manage them
seems like rampant overkill. A trivial in-memory representation would be to
just store the words in a per-document UT_AlphaHashTable. Then if we need
to persist that information in the file format -- does Word do this, BTW? --
we could serialize that word list in a header section of the document.

Likewise, personal dictionaries also tend to be far, far smaller than ispell
dictionaries -- my idiolect is a *lot* smaller than the rest of the English
language :-) -- so a similar approach should work there too. In this case,
the UT_AlphaHashTable would be app-wide, and could easily persist to a
simple text file with one word per line. In fact, iterative calls to
UT_AlphaHashTable::getNthEntryAlpha() would even ensure that the resulting
file is alpha-sorted, which is pretty nice.

This should also make it quite easy to mimic W97's UI trick of editing
personal dictionaries by loading them in as plain-text documents. ;-)

> As these get done, a generic spelling interface should get built, which

Yep. The current API isn't a very nice hack, is it?

If we follow the alphahash approach suggested above, then the replacement
API could stay quite simple, since personal dictionary management and the
whole ignore concept would both be handled by the app, and not by the
particular dictionary lookup engine.

This should make switching from ispell dictionaries to your favorite *spell
engine much easier, because all that engine would need to do is check and
suggest individual words.

Justin Bradford (justin@ukans.edu)
Wed, 22 Sep 1999 16:09:28 -0500 (CDT)

> We'd love to have a cooler engine than ispell, and aspell's results sure

I was just considering adding the algorithmn (not the actual code) to the
ispell base. It's language dependent (metaphone mapping), but something we
could make a configuration option.

Also, it would be easy to add glue for aspell (and other similar
libraries) which could be an optional build item. eg. if the aspell
library is present, compile with that rather than our modified ispell.

Anyway, I've been stalled by some other work for a bit, but the dialog
should be done soon.

> Since ignore lists tend to be small, actually using ispell to manage them

Yeah, that's where I was planning to someday to put them. As for storing
them in file, I am not aware of Word doing that. It was just an idea; I
assumed if someone said to ignore all instances of a word, they'd like it
to not be squiggled when they reload the doc. Perhaps it would be better
(UI-wise, anyway) to have them click "add to document's dictionary"
rather than "ignore all" before it's preserved across sessions.

> Likewise, personal dictionaries also tend to be far, far smaller than ispell

Ok, that should work. However, I think some spell checking systems handle
the user-dictionary side of things, too. It might be nice to handle that,
but I say we revisit the problem if it ever comes up.

Shaw Terwilliger (sterwill@postman.sourcegear.com)
Wed, 22 Sep 1999 16:14:43 -0500

> I was just considering adding the algorithmn (not the actual code) to the

Please, unless you have some unholy love of ispell, avoid ispell entirely.
:) I realize adding code to ispell means people world-wide automatically
have the new algorithm when they upgrade their computers, but the reason
we'd like to avoid ispell doesn't have to do with the algorithm it
uses. It's more that its code is just that icky, its build system is
older than I am, and we still inherit all those endianness/byte order/struct
problems from its dictionaries.

Justin Bradford (justin@ukans.edu)

> Please, unless you have some unholy love of ispell, avoid ispell entirely.

I agree; ispell is a mess. Lots of globals and static arrays, too.

A new and improved, well designed spell check library would be good. I
wasn't going to start rewriting the ispell base we have now right away, so
we can consider alternatives. In fact, some one emailed me earlier about
some code they have. When I get to it, I might start by converting his
code to C/C++ (assuming he doesn't beat me to it).

Subject: ispell alternatives
Paul Rohr (paul@abisource.com)
Mon, 22 Nov 1999 11:01:07 -0800

I totally agree with Justin here. We need to ship a single, standard,
cross-platform solution for both spell-checking code *and* dictionaries.

The ispell code we're using for this is *definitely* no thing of beauty --
far from it -- but it does meet those goals. Note that the the technical
issues here (replacing the ispell code with something nicer) are much easier
to address than the practical/legal issues (replacing the ispell
dictionaries).

Remember that good dictionaries, like good software, take a *lot* of
specialized work to construct, and they also fall under copyright laws.

One of the main advantages of ispell-format dictionaries is that they
already exist in distributable form for a large number of the languages
AbiWord users might want, and the tools for creating more such dictionaries
are in reasonably wide use:

http://fmg-www.cs.ucla.edu/geoff/ispell.html

In general, obtaining high-quality, freely-distributable dictionaries is
hard, primarily due to a lack of tools and copyright restrictions. (For
example, typing in the word list from your favorite paper dictionary is
probably a copyright violation. Depending on your system, so is adding the
contents of /usr/dict/words to your ispell dictionary, which may only be
legal if you don't distribute the results. Depending on the copyrights for
the ispell-format word lists, it may also be illegal to modify them by
converting them to another dictionary format. See the ispell docs for more
info on this problem.)

That having been said, of course there's no objection to coming up with
interfaces for wrappers to additional spell checkers in a post-1.0 time
frame. That hurdle should be fairly low.

To be clear though, I want to set accurate expectations for anyone who hopes
to replace ispell entirely as the main spell engine for AbiWord (and other
subsequent AbiSuite apps). Because of the dictionary problem, that bar is
*significantly* higher.

PS: Anyone interested in replacing ispell should consider spend some time
trolling around the ispell distribution, if you haven't already. If nothing
else, be sure to read the Contributors file. In it you'll find an
abbreviated history of 40 (!) years work on spell checkers with all sorts of
fascinating obscure details. For example, approximately 270 (!) people have
contributed to this program, which traces its roots back almost 30 (!)
years.

Tony Merc Mobily (merc@squiz.net)
Tue, 23 Nov 1999 03:16:15 +0000 (/etc/localtime)

> The ispell code we're using for this is *definitely* no thing of beauty --
> far from it -- but it does meet those goals. Note that the the technical

I would like to make clarify that I was not addressing the problem of the
quality of ispell itself in my email.
What I was saying was: my operating system, Unix, comes with a spell
checker and some dictionaries.

In an idea world, *all* the programs use that spelling program, and *all*
the programs share the same main dictionary/dictionaries.

The first goal is reached by not linking Abiword with anything, but having
Abiword *using* ispell (see: running it, even only once and then piping
things in). Please remember that the command that corresponds to "ispell"
might be an ispell clone, that keeps the same interface.

The second goal ("all programs share the same dictionary") is archieved
automatically using ispell as a spell checker, and piping into it.

> Remember that good dictionaries, like good software, take a *lot* of
> specialized work to construct, and they also fall under copyright laws.

This is *exactly* the reason why you might want to try to help the
existing dictionaries grow, instead of trying to have an "Abiword's" one.

> One of the main advantages of ispell-format dictionaries is that they

Yep, but that doesn't fit with the "bigger picture", where the *whole* lot
of applications on your computer benefit from a good dictionary.

> In general, obtaining high-quality, freely-distributable dictionaries is

This is undoubtly true.

> That having been said, of course there's no objection to coming up with

I wasn't talking about "additional spell checkers", but a different
approach to the spell checking problem that would take into consideration
the "big picture" out there.

> To be clear though, I want to set accurate expectations for anyone who

I didn't quite understand what the link between the dictionaries and the
way you interface to the spell checker is.

> PS: Anyone interested in replacing ispell should consider spend some time

Again, I wasn't proposing to get rid of ispell, or whatever. I was just
suggesting a way of having *one* spell checker and *one* dictionary on the
system, instead of having *five* spell checker and *five* dictionaries for
*five* different word processors. Offering, at the same time, the
opportunity to use a different system spell checker.

That's it.

Paul Rohr (paul@abisource.com)
Tue Nov 30 1999 - 13:36:41 CST

Merc,

Sounds like we're in violent agreement here. Figuring out good ways to take
advantage of existing ispell dictionaries is a Good Thing for users, even
though it gives developers fits.

That's why we chose to clean up and integrate portions of the ispell code
into AbiWord, so that we could leverage those dictionaries across all our
supported platforms. The fact that you can have interactive spell-checking
(squiggles, dictionaries, and popup corrections) running in real time off
those dictionaries is quite an achievement.

Even more so, because that code, and those dictionaries, are *brittle*.
Darren Benham, Justin, and others have done battle with some ferocious
problems (leaks, endian-ness, multiple languages, etc.) to get to where we
are today. Even so, nobody's managed yet to figure out how to properly
translate accented characters (from Unicode to whatever ispell expects) so
that lookups will work in French, etc. For more details, see the relevant
POW from last month:

  http://www.abisource.com/mailinglists/abiword-dev/99/October/0325.html

I know you're interested in solving a larger issue -- providing spell
services to lots of applications on your platform of choice -- and we'd love
it if you or someone else could solve that problem for us. :-)

In the mean time, we'll continue to work to make sure we have a single
solution which Just Works on all our supported platforms. We'd also like to
make sure that solution isn't too closely tied to AbiWord, so that we can
reuse it for other apps in AbiSuite. We've got a ways to go on this.

At the moment, our spelling APIs are more entwined with our app-specific
logic than we'd like, but at least it works. As folks have time, we'd like
to do a better job of isolating dictionary lookup services from the rest of
the program behind a clean, small API. Once this is done, it should be much
easier to swap out the hardwired ispell-based support for other
alternatives, including:

  - piping out to ispell, where applicable
    - hooking into other platform-specific services, as available, or
      - some other XP solution which is cleaner or better.
      
      Note that we've already been implementing certain peripheral
spell-related
services (such as ignore lists, custom dictionaries, etc.) outside of the
ispell codebase to make it easier to change dictionary engines later on as
they become available.

bottom line
-----------
As always, the best way to change our minds about design decisions like this
is to show us the source for another approach which works better than what
we've currently got.

Endless discussions over email can get very frustrating without necessarily
resolving anything. Checking in the right patch is a *much* more satisfying
resolution to any debate. :-)

The rest of this thread talks about endian problem and can be found at
http://www.abisource.com/mailinglists/abiword-dev/99/December/0003.html.

Subject: Re: Spellchecking - [better aida subject;-)]
Paul Rohr (paul@abisource.com)
Tue Feb 15 2000 - 15:52:58 CST

>What is the plan for spellchecking at the moment?

There are two main virtues of the existing ispell code:

1. There are a *lot* of existing ispell dictionaries we can use, covering
most of the languages we already have translations for, plus almost a dozen
we don't. This is a very big deal.

2. Thanks to a lot of hard work by Darren, Justin, and Henrik, the code
usually works well enough on all of our supported platforms.

Other than that, nobody who's had to wrestle with that codebase is in love
with it. Also, the current APIs between the ispell code and the rest of
AbiWord are no thing of beauty. Their main virtue is that they work for so
many languages.

From time to time, advocates of other spelling codebases have popped up, and
our usual response is as follows:

1. Whatever you do, don't break ispell.

2. Otherwise, feel free to propose new APIs to the spell code which would
allow you to swap in another spell engine.

3. Make sure you have good dictionary support for other languages. If you
think writing spell lookup code is bad, getting freely-licensed dictionary
content is much, much worse.

4. Try to *add* interesting functionality, rather than making changes for
change's sake. For example, the quality of aspell's suggestions makes it a
tempting target.

5. Don't violate the coding guidelines found in abi/docs. For example,
every time we've looked at aspell, it gets immediately shot down for its
heavy reliance on advanced C++ features like templates.



This archive was generated by hypermail 2b25 : Sat Feb 19 2000 - 00:20:39 CST