Re: Barbarism implementation proposal

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Wed Sep 25 2002 - 00:33:37 EDT

  • Next message: mike: "Re: spell-checking non-functional in cvs"

     --- Jordi Mas <jmas@softcatala.org> wrote:
    > Hello guys,

    Hi Jordi.

    > After some discussion in the list and some IRC
    > talking with Dom and others, I have put together an
    > implementation proposal for the barbarism detection
    > feature.
    >
    > * What is a barbarism
    >
    > Barbarism is a problem that manly concerns to
    > minority languages, i.e. languages that are
    > competing, in the same territory, with a more
    > powerful one, called "rooflanguage", for example
    > Welsh, Catalan, Occitan, and others.
    >
    > When two languages compete in the same territory
    > comes up interferences, but they are not symmetric.
    > The roof language is weakly affected but the
    > minority one can be strongly affected, and can
    > disappear (glottophagy). One of these
    > interferences is barbarism.
    >
    > * How we implement it
    >
    > We have a class called Barbarism that lives in
    > 'src\other\spell\xp\'.

    We should move proofing features to a new directory
    which makes more sense, spelling, grammar, and
    barbarisms can be arranged under it in whatever system
    makes most sense. Should it go under XAP or AP?
    Spellchecking seems more cross-platform that the other
    parts...?

    > We init the class when the ispellclass is created
    > and when we do CheckWord and suggestWord we add also
    > call the Barbarism class.
    >
    > * How we store them
    >
    > The file that contains the barbarisms is an XML file
    > that lives in the same directory that the
    > dictionaries and it has the same name that the
    > dictionary file but with a barbarism extension.
    >
    > For example, for American will be
    > "american.barbarisms"

    This is a bad idea. Only ispell dictionaries have
    these names and I for one hate the ispell naming
    system. Aspell probably uses different names and
    lives in different dictionaries. Note that many
    distros have ispell hashes in places other than where
    AbiWord probably prefers them. I'm doing some work
    now to make AbiWord use any ispell hash files it can
    find. Hopefully I can even make it use a mix of
    ispell and aspell (and maybe even myspell)
    dictionaries.

    Making the barbarisms files depend just on the old
    ispell stuff is ugly - the names have to live in
    tables
    and are not easily extended by non-programmers. I'm
    hoping to improve this. Please give the barbarism
    files logical names based on language tags such as
    "ca.barbarisms" or "ca-ES.barbarisms".

    > This is an example file. The attribute "word"
    > contains the wrong word, the attribute suggestion
    > contains the right word to use
    >
    > <?xml version="1.0" encoding="utf-8"?>

    In XML, leaving out the encoding field causes it to
    default to UTF-8.

    > <AbiBarbarism app="AbiWord" ver="1.0"
    > language="ca-ES">
    >
    > <Barbarism
    > word="boleto"
    > suggestion1="billet"
    > />
    >
    > <Barbarism
    > word="tiro"
    > suggestion1="tret"
    > />
    >
    > <Barbarism
    > word="tanteig"
    > suggestion1="tempteig"
    > />
    >
    >
    > <Barbarism
    > word="tamany"
    > suggestion1="mida"
    > suggestion2="grandària"
    > />
    >
    > </AbiBarbarism>

    That looks nice. I'm not sure if "suggestion1" and
    "suggestion2" is the best XML solution. Shouldn't
    lists best be done with actual tags? I'm not an XML
    expert - can an XML expert give an opinion please?

    > * Known problems in the design
    >
    > - We work at word level, not sentence level. We are
    > just hacking a spell checker

    I think this the correct way to do it.

    > - Words that can be declined have to be coded
    > several times (plurals, verbs declinations, etc). At
    > least in Catalan, this is not very common.

    Spelling hashes do this anyway. Agglutinative
    languages may be more painful but that's already the
    case for spelling hashes so we probably don't need to
    worry. If there are some Finnish or Hungarian
    speakers here do you have any ideas on this?

    > Ok, that basically it. I would love to heard your
    > comments to see how we can define this better that
    > it is right now.

    To me it's not so important whether it's part of
    spelling, part of grammar, part of style, or its own
    separate thing; but all these need to be move under a
    general proofing concept where squiggles and other
    common things can be grouped.
    Also whether it's part of spelling or part of grammar
    I still want to be able to enable it separately.

    You know, this could be a better way to solve the
    problems of English spelling varieties. British,
    Canadian, Australian, Irish, and US spellings all
    differ slightly and currently each one just decides
    whether to use the british or american hash.
    Nobody has so far wanted to build a special Australian
    or Canadian hash from scratch but it would be a lot
    less work to build barbarism files for them since they
    only need the exceptions to be listed.

    In this case an extra XML field "comment" or "reason"
    or something better could contain something useful
    such as "Americanism" to provide more info in the
    dialog.

    Alan, what do you think about this idea?

    Andrew Dunbar.

    > Thanks,
    >
    > --
    >
    > Jordi Mas
    > http://www.softcatala.org
    >
    >
    >
    >

    =====
    http://linguaphile.sourceforge.net/cgi-bin/translator.pl http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Wed Sep 25 2002 - 00:39:13 EDT