Re: Hebrew spellcheck

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sun Sep 08 2002 - 22:24:16 EDT

  • Next message: Andrew Dunbar: "Re: Hebrew in AbiWord"

     --- Uri David Akavia <uridavid@netvision.net.il>
    wrote:
    > Shalom.

    Shalom.

    > I sent your message to the ivrix list. I'm sorry it
    > took so long.

    No problem. I'm CC'in the AbiWord dev list with this
    answer since a couple of people there may have some-
    thing to add.

    > Since the problem came up with spell checking, why
    > do you need these characters for it at all?

    I'm sorry I don't have the original thread of the
    conversation around since I'm out of inbox space ):

    > Hebrew has two form of writing (I think you probably
    > know this)
    > KTIV MALEH (full writing) in which letters replace
    > the marks
    > KTIV HASER (lacking writing) in which there are no
    > special letters

    Yes I'm aware of this. In fact I think there's more
    than two ways. I've read about this in a history of
    Hebrew at my library. Originally there were no vowels
    at all, then later, yod, vav, aleph, and ayin began to
    be used to represent vowels with special rules.
    They're usually called matres lectionis in English.
    Then later still, the vowel points were invented and
    I believe these are used in combination with the
    matres lectionis.

    > It is possible just to choose one (I prefer KTIV
    > MALEH), which I believe is correct if you don't
    > write the marks, which most people don't.
    > Whatever is decided should just be written in the
    > documentation somewhere. Besides, these marks have
    > absolutely no value in spellchecking - no one
    > actually checks them for correctness (it is much
    > harder than checking spelling, since it has rules).
    > So it is not a problem when you don't check them in
    > the spellchecker.

    Well I wish it was that simple. The problem is that
    people do use them and will continue to use them.
    Religious texts always seem to use them. This is an
    important case for Hebrew and we do already have users
    doing Biblical work in Hebrew with AbiWord.
    Now if some text is marked as being Hebrew and it does
    have vowel marks, they simply won't match the entries
    in the dictionary at all. So they'll all be marked as
    errors! The next step would be for us to filter out
    all the vowel points before passing words to the
    spell-checker. But now imagine we are editing a
    section of Genisis which has full vowel points and
    also some spelling errors. The spellchecker will tell
    you the word has errors and offer some suggestions.
    But all the suggestions will have no vowel points!
    Perhaps the user will be able to fix it, perhaps not.
    But the computer is a machine and ought to be able to
    do exactly this type of work for us.
    Also, the very reason the points are used in religious
    works is to remove any ambiguity raised when two words
    have the same consonants but different vowels.
    A user would expect that if we support Hebrew we
    support this. But when she has words with correct
    consonants in the correct order but with incorrect
    vowel points, no error will be shown and the user will
    be lead to believe she has made no errors.

    So the next step is to think, well I guess the Bible
    is pretty important so maybe we should just have two
    dictionaries, one without vowels that we can do now
    and start using right away for most things, then
    Biblical Hebrew can be treated as a separate language
    with its own dictionary made by whoever needs to use
    such a thing. Problem is, we're trying to stick to
    standards so we used ISO 639 language codes to mark
    sections of our documents as to which language they
    belong to. ISO 639 can be a bit vague. It does now
    have separate codes for Modern Greek (ell or gre), and
    Ancient Greek (grc); but it still has only one code
    for
    Hebrew (heb).

    Also, I collect foreign novels and the only one I have
    in Hebrew, Memoirs of a Geisha by Arthur Golden seems
    to have at least one word per 5 pages or so which is
    using vowel points. If AbiWord is to be a
    proffesional
    quality word processor, it needs to be good enough
    for the translators to have used it to create this
    book.

    I have created a bug report for AbiWord some time ago
    suggesting that we need more flexibility in our use
    of language codes so you might care to look into that:
    http://bugzilla.abisource.com/show_bug.cgi?id=3227

    It might seem like I'm fighting against Hebrew spell-
    checking but I'm really not. I just want to do it
    right - and it is doable. I'd love to implement it
    myself.

    What we can do is start building up a high quality
    Hebrew wordlist. Probably as a plain text UTF-8
    encoded file. We can start with just the vowelless
    words and add the vowelled versions later.
    But we really shouldn't lock in place a system which
    isn't going to be flexible enough in the long-term.

    If you think building a word-list is a good idea we
    might be able to give it a place in AbiWord's CVS
    somewhere. Perhaps creating a special project just
    for this on SourceForge is a better idea.

    Hope this helps.
    Andrew Dunbar.

    > Yours,
    >
    > Uri David
    >

    =====
    http://linguaphile.sourceforge.net/cgi-bin/translator.pl http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Sun Sep 08 2002 - 22:27:21 EDT