Re: localization formats proposal


Subject: Re: localization formats proposal
From: Kevin Atkinson (kevina@users.sourceforge.net)
Date: Wed Jul 18 2001 - 18:15:05 CDT


On 18 Jul 2001, Ron Ross wrote:

> I went searching for Canadian ispell hash files recently. I didn't find
> any, but I did find a word list package by Kevin Atkinson:
>
> "VarCon (Variant Conversion Info) contains tables to convert between
> American, British, and Canadian spellings and vocabulary as well as a
> table listing the equivalent forms of other variants"
>
> which can be found at http://wordlist.sourceforge.net/. It seems to be
> made for Aspell. I haven't actually done anything with this package yet
> (can the lists be converted to hash files?), but looking through the
> lists, it seems pretty reliable.

The list you want to use is SCOWL. It provides much better word lists
than the ones provided by Ispell. In offers them in a large variety of
sizes, and in three varieties American, British, Canadian. Aspell uses
size 65 (+ a few special word lists) for Aspell which I found to be about
right as it includes almost all words the average person will use (plus a
couple added hacker terms). I have been trying to get AbiWord to use them
for there Ispell hash files but it seams like no one cares.

All that is needed is for someone to create the hash files from the word
lists. I will gladly send the word list that Aspell uses, however it is
fairly easy to create it from SCOWL. In fact here is how to create them
(with bash as the shell):

cd final/
cat english-*.{10,20,35,50,60,65} special-* > english.wl
cat english.wl american-*.{10,20,35,50,60,65} > american.wl
cat english.wl british-*.{10,20,35,50,60,65} > british.wl
cat english.wl canadian-*.{10,20,35,50,60,65} > canadian.wl

However those words contain accented characters in iso8859-1 format which
most people find annoying. To make the word lists without them:

cd src
make deaccent
cd ..
cd final/
cat english-*.{10,20,35,50,60,65} special-* | ../src/deaccent > english.wl
cat english.wl american-*.{10,20,35,50,60,65} | ../src/deaccent > american.wl
cat english.wl british-*.{10,20,35,50,60,65} | ../src/deaccent > british.wl
cat english.wl canadian-*.{10,20,35,50,60,65} | ../src/deaccent > canadian.wl

---
Kevin Atkinson
kevina at users sourceforge net
http://www.ibiblio.org/kevina/



This archive was generated by hypermail 2b25 : Wed Jul 18 2001 - 18:38:15 CDT