Re: Implementing support for barbarisms correction

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sun Sep 22 2002 - 01:32:41 EDT

  • Next message: Andrew Dunbar: "Re: Implementing support for barbarisms correction"

     --- Dom Lachowicz <doml@appligent.com> wrote:
    > On Sat, 2002-09-21 at 11:31, Jordi Mas wrote:
    >
    > > Well, I think that we need a solution that marks
    > > the misspelled word and offers a replacement,
    > > since that what a user will expect from a spell
    > > checker. I do not think that we can do this using
    > > ispell since it just does not work that way.
    >
    > One of my questions is "should these words be marked
    > as misspelled at all?" Words that are "incorrect"
    > but are widely in use sound like good candidates for
    > addition into a language, regardless of what the
    > French and Spanish purists/governments think.
    > Languages are fluid, evolving things and borrow
    > heavily from their times, surroundings, other
    > languages and cultures, technology/science, and the
    > people who speak the language. But let's say that
    > these words should really be marked as incorrect,
    > for the sake of argument.

    Unless the barbarisms are in the hash file they're
    going to be marked as misspelled anyway but they're
    not going to have useful replacement suggestions.

    > If they really aren't allowable words, does this
    > fall under a spell checking problem? I'd argue no.
    > Spell checking solves the problem of mapping:
    >
    > possibly misspelled word->correctly spelled word(s)
    >
    > and not
    >
    > possibly suboptimal/illegal word->better/legal
    > word(s)
    >
    > In my opinion, this looks like a different, but
    > related problem, one related to a language's
    > constructs (i.e. something more closely related
    > to grammar) than to the spelling of its words. If
    > you argue that a misspelled word is "suboptimal"
    > or "illegal" you would be correct in a sense. But
    > here the user's intent was to write a legal/optimal
    > word. In your case, it is the user's intent to write
    > a correctly spelled "suboptimal" word. Should some
    > sort of warning pop up? Maybe, but that could get
    > annoying really fast if I honestly mean to use these
    > words.
    > But I don't want to disable spell checking because I
    > really do want the rest of the words checked.
    >
    > So through all of this, we've proven is that this is
    > possibly a proofing problem, and not a spell
    > checking problem. Read on.
    >
    > > I see this an enhancement to the spell checking
    > > system and it most likely will take under 100
    > > lines code. Any particular reason that makes you
    > > think that this is not appropriated for our main
    > > tree?
    >
    > Because I don't see this as a spelling issue and
    > don't believe that it will only take 100 lines to
    > get it right. Consider simply these folowing
    > cases. Please tell me how to fix them without
    > creating a *huge* barbarism file and how to properly
    > identify and handle them in under 100 lines of code:
    >
    > *) Mixed capitalization (ComPutEr)
    > *) Different verb tenses (compute, computed, has
    > computed)
    > *) Pluralization (computes, computers)
    > *) Split infinitives
    > *) The "barbaric" word is misspelled. You'd need to
    > do at least 2
    > mappings here to get the intended effect: misspelled
    > barbaric->correct
    > barbaric->preferable word
    >
    > Note that this is just what I could think of in 30
    > seconds, and isn't an exhaustive study of the
    > problem at hand.
    >
    > I see this as a separate service that we could
    > provide in addition to spell checking, but it is
    > certainly not spell checking.

    These are the same problems I foresaw. Ispell needs
    every wordform listed. This project will also need
    every wordform listed. If the list is large it may
    even need affix compression.
    You're not going to have a decent system with 100 LOC.
    If you do it, do it properly.
    And I agree it's definitely not a spelling problem
    It's a style problem. I think mixing style into the
    grammar checker is the solution since in fact grammar
    checkers really actually check grammar in my
    experience
    but instead check style issues such as passive mood
    and capitalization.
    If we make it part of a grammar checker, we can give
    all grammar options a separate switch as in MS Word,
    but they can all share squiggle code and squiggle
    colour.

    > > Alan has suggesting that we can implement this as
    > > an enhanced custom.dic for every language. It
    > > makes sense to me. What do you think?
    >
    > I don't think that you can achieve this through
    > using a custom.dic for every language, as the
    > custom.dic only has a list of words you mark as
    > "allowable" or "correctly spelled" for a language.
    > It doesn't offer a mapping from wrong->correct word.
    > It doesn't use any algorithm (eg: soundex, visual
    > similarity) to suggest words. To go through this
    > route, in my estimation, would involve writing
    > something nearly equivalent in both size and scope
    > to ispell.

    See my discussion of Aspell/Pspell.

    <snip>

    > Is this something useful, in my opinion?
    > Maybe/probably. Would I object to it being a plugin?
    > Probably not. Do I still object to it being in the
    > main tree? Yup.
    >
    > Cheers,
    > Dom
    >

    =====
    http://linguaphile.sourceforge.net/cgi-bin/translator.pl http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Sun Sep 22 2002 - 01:37:04 EDT