Re: Implementing support for barbarisms correction

From: Jordi Mas (jmas@softcatala.org)
Date: Sat Sep 21 2002 - 13:16:42 EDT

  • Next message: Alan Horkan: "Re: Implementing support for barbarisms correction"

    Hello Dom,

    > One of my questions is "should these words be marked as misspelled at
    > all?" Words that are "incorrect" but are widely in use sound like good
    > candidates for addition into a language, regardless of what the French
    > and Spanish purists/governments think. Languages are fluid, evolving
    > things and borrow heavily from their times, surroundings, other
    > languages and cultures, technology/science, and the people who speak the
    > language. But let's say that these words should really be marked as
    > incorrect, for the sake of argument.

    Well, I get your point. There are neologisms that are old words used with a
    new meaning or just taken from another languages. They are OK to me. In
    computers they are very common, and since the English culture is who is
    producing the technology we should use its neologisms in most of the cases.

    In the other side, barbarisms are different. They are just wrong words. If you
    already have a word in your language to express a concept and you use an
    incorrect one that is a barbarism.

    > If they really aren't allowable words, does this fall under a spell
    > checking problem? I'd argue no. Spell checking solves the problem of
    > mapping:
    >
    > possibly misspelled word->correctly spelled word(s)
    >
    > and not
    >
    > possibly suboptimal/illegal word->better/legal word(s)

    Well, the most popular commercial Catalan spell checker (WordCorrect) has a
    barbarism correction feature, that's why I tough that it would be cool to have
    it in Abi also.

    I personally believe that originally spell checkers were designed to fix
    misspelled words as you say, but with the time, they have become a tool that
    helps people to make less mistakes when they write, that the reason why Word
    for example, include some kind of basic grammatical correction.

    > In my opinion, this looks like a different, but related problem, one
    > related to a language's constructs (i.e. something more closely related
    > to grammar) than to the spelling of its words. If you argue that a
    > misspelled word is "suboptimal" or "illegal" you would be correct in a
    > sense. But here the user's intent was to write a legal/optimal word. In
    > your case, it is the user's intent to write a correctly spelled
    > "suboptimal" word. Should some sort of warning pop up? Maybe, but that
    > could get annoying really fast if I honestly mean to use these words.

    Barbarisms are words that you use by mistake, usually people do not want to
    write incorrectly their language.

    > Because I don't see this as a spelling issue and don't believe that it
    > will only take 100 lines to get it right. Consider simply these following
    > cases. Please tell me how to fix them without creating a *huge*
    > barbarism file and how to properly identify and handle them in under 100
    > lines of code:
    >
    > *) Mixed capitalization (ComPutEr)
    > *) Different verb tenses (compute, computed, has computed)
    > *) Pluralization (computes, computers)
    > *) Split infinitives
    > *) The "barbaric" word is misspelled. You'd need to do at least 2
    > mappings here to get the intended effect: misspelled barbaric->correct
    > barbaric->preferable word
    > Note that this is just what I could think of in 30 seconds, and isn't an
    > exhaustive study of the problem at hand.
    > I see this as a separate service that we could provide in addition to
    > spell checking, but it is certainly not spell checking.

    Dom, there are not many barbarisms. We are talking usually about 100-200
    words. We are not talking to re-implement a full spellcheking system. A
    simple list will be enough. It's true that you will have two entries if the
    word is plural and only some forms of verbs are barbarms.

    > I don't think that you can achieve this through using a custom.dic for
    > every language, as the custom.dic only has a list of words you mark as
    > "allowable" or "correctly spelled" for a language. It doesn't offer a
    > mapping from wrong->correct word. It doesn't use any algorithm (eg:
    > soundex, visual similarity) to suggest words. To go through this route,
    > in my estimation, would involve writing something nearly equivalent in
    > both size and scope to ispell.

    Well, you can have a custom.dic that has one word per line, or two words with
    a special separation character that indicates misspelled word, suggested word.

    > You asked if people had objections. I had one. It seems silly to
    > basically say "I'm looking for objections" and then tell me that "It's
    > too early to object, wait until we discuss more," especially since your
    > message didn't even mention the possibility of discussion. Your email
    > basically said "Here is perceived problem X. Does anyone want to stop me
    > because I'm about to implement something to fix perceived problem X."
    > The logic seems a bit flawed, at least to me...

    Dom, since you are the maintainer if you say "I don't want this in
    the main tree." I interpret this as "that's the end of the conversation, keep
    this feature out the main cvs".

    > Is this something useful, in my opinion? Maybe/probably. Would I object
    > to it being a plugin? Probably not. Do I still object to it being in the
    > main tree? Yup.

    If you still think that we should use a plugin, can you describe a bit more
    how do you think that I should work?

    I appreciate your comments Dom,

    Thanks,

    -- 
    

    Jordi Mas http://www.softcatala.org



    This archive was generated by hypermail 2.1.4 : Sat Sep 21 2002 - 13:22:23 EDT