Re: Barbarism implementation proposal

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Wed Sep 25 2002 - 00:33:37 EDT

Next message: mike: "Re: spell-checking non-functional in cvs"

Previous message: Andrew Dunbar: "Re: internationalization: I can't get it to work"
In reply to: Jordi Mas: "Barbarism implementation proposal"
Next in thread: Dom Lachowicz: "Re: Barbarism implementation proposal"
Reply: Dom Lachowicz: "Re: Barbarism implementation proposal"
Reply: Alan Horkan: "Re: Barbarism implementation proposal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

--- Jordi Mas <jmas@softcatala.org> wrote:
> Hello guys,

Hi Jordi.

> After some discussion in the list and some IRC
> talking with Dom and others, I have put together an
> implementation proposal for the barbarism detection
> feature.
>
> * What is a barbarism
>
> Barbarism is a problem that manly concerns to
> minority languages, i.e. languages that are
> competing, in the same territory, with a more
> powerful one, called "rooflanguage", for example
> Welsh, Catalan, Occitan, and others.
>
> When two languages compete in the same territory
> comes up interferences, but they are not symmetric.
> The roof language is weakly affected but the
> minority one can be strongly affected, and can
> disappear (glottophagy). One of these
> interferences is barbarism.
>
> * How we implement it
>
> We have a class called Barbarism that lives in
> 'src\other\spell\xp\'.

We should move proofing features to a new directory
which makes more sense, spelling, grammar, and
barbarisms can be arranged under it in whatever system
makes most sense. Should it go under XAP or AP?
Spellchecking seems more cross-platform that the other
parts...?

> We init the class when the ispellclass is created
> and when we do CheckWord and suggestWord we add also
> call the Barbarism class.
>
> * How we store them
>
> The file that contains the barbarisms is an XML file
> that lives in the same directory that the
> dictionaries and it has the same name that the
> dictionary file but with a barbarism extension.
>
> For example, for American will be
> "american.barbarisms"

This is a bad idea. Only ispell dictionaries have
these names and I for one hate the ispell naming
system. Aspell probably uses different names and
lives in different dictionaries. Note that many
distros have ispell hashes in places other than where
AbiWord probably prefers them. I'm doing some work
now to make AbiWord use any ispell hash files it can
find. Hopefully I can even make it use a mix of
ispell and aspell (and maybe even myspell)
dictionaries.

Making the barbarisms files depend just on the old
ispell stuff is ugly - the names have to live in
tables
and are not easily extended by non-programmers. I'm
hoping to improve this. Please give the barbarism
files logical names based on language tags such as
"ca.barbarisms" or "ca-ES.barbarisms".

> This is an example file. The attribute "word"
> contains the wrong word, the attribute suggestion
> contains the right word to use
>
> <?xml version="1.0" encoding="utf-8"?>

In XML, leaving out the encoding field causes it to
default to UTF-8.

> <AbiBarbarism app="AbiWord" ver="1.0"
> language="ca-ES">
>
> <Barbarism
> word="boleto"
> suggestion1="billet"
> />
>
> <Barbarism
> word="tiro"
> suggestion1="tret"
> />
>
> <Barbarism
> word="tanteig"
> suggestion1="tempteig"
> />
>
>
> <Barbarism
> word="tamany"
> suggestion1="mida"
> suggestion2="grandària"
> />
>
> </AbiBarbarism>

That looks nice. I'm not sure if "suggestion1" and
"suggestion2" is the best XML solution. Shouldn't
lists best be done with actual tags? I'm not an XML
expert - can an XML expert give an opinion please?

> * Known problems in the design
>
> - We work at word level, not sentence level. We are
> just hacking a spell checker

I think this the correct way to do it.

> - Words that can be declined have to be coded
> several times (plurals, verbs declinations, etc). At
> least in Catalan, this is not very common.

Spelling hashes do this anyway. Agglutinative
languages may be more painful but that's already the
case for spelling hashes so we probably don't need to
worry. If there are some Finnish or Hungarian
speakers here do you have any ideas on this?

> Ok, that basically it. I would love to heard your
> comments to see how we can define this better that
> it is right now.

To me it's not so important whether it's part of
spelling, part of grammar, part of style, or its own
separate thing; but all these need to be move under a
general proofing concept where squiggles and other
common things can be grouped.
Also whether it's part of spelling or part of grammar
I still want to be able to enable it separately.

You know, this could be a better way to solve the
problems of English spelling varieties. British,
Canadian, Australian, Irish, and US spellings all
differ slightly and currently each one just decides
whether to use the british or american hash.
Nobody has so far wanted to build a special Australian
or Canadian hash from scratch but it would be a lot
less work to build barbarism files for them since they
only need the exceptions to be listed.

In this case an extra XML field "comment" or "reason"
or something better could contain something useful
such as "Americanism" to provide more info in the
dialog.

Alan, what do you think about this idea?

Andrew Dunbar.

> Thanks,
>
> --
>
> Jordi Mas
> http://www.softcatala.org
>
>
>
>

=====
http://linguaphile.sourceforge.net/cgi-bin/translator.pl http://www.abisource.com

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com

Next message: mike: "Re: spell-checking non-functional in cvs"
Previous message: Andrew Dunbar: "Re: internationalization: I can't get it to work"
In reply to: Jordi Mas: "Barbarism implementation proposal"
Next in thread: Dom Lachowicz: "Re: Barbarism implementation proposal"
Reply: Dom Lachowicz: "Re: Barbarism implementation proposal"
Reply: Alan Horkan: "Re: Barbarism implementation proposal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Wed Sep 25 2002 - 00:39:13 EDT