Re: Subject: LyX 2.0beta3: Spell Checking + Multilingualism

From: Dominic Lachowicz <domlachowicz_at_gmail.com>
Date: Tue Feb 01 2011 - 00:29:04 CET

Hi Walter,

Can you parse this out into a series of questions? I'd like to help,
but I hate reading through weeks' of emails to figure out the
important bits.

Thanks,
Dom

On Mon, Jan 31, 2011 at 3:20 PM, Walter <walter.stanish@gmail.com> wrote:
>
> Hi there,
>
> Sorry for the rather random request, but would you mind reviewing (the
> bottom portion of) this email?
>
> It relates to potential scope and features for enchant (vs. hunspell,
> etc.) as they may pertain to assisting with the ongoing development of
> LyX, a LaTeX frontend available from http://www.lyx.org/
>
> It appears to me that enchant and LyX could be great allies.
>
> Therefore it would be great if you could scan through the following
> and correct any misconceptions, since it is really not my area of
> expertise and I am uncertain whether the volunteers within the LyX
> development community are familiar with the latest features of
> enchant.
>
> Thanks for your work and best wishes,
> Walter
>
> ---------- Forwarded message ----------
> From: Walter <walter.stanish@gmail.com>
> Date: 28 January 2011 08:43
> Subject: Re: Subject: LyX 2.0beta3: Spell Checking + Multilingualism
> To: LyX Users <lyx-users@lists.lyx.org>
> Cc: LyX Devel <lyx-devel@lists.lyx.org>
>
>
>>> http://www.mail-archive.com/lyx-users@lists.lyx.org/msg83713.html
>>
>> Ok, AFAIU this refers to the XeLaTeX engine and not to LyX.
>> Of course, if someone wants to develop a solid algorithm for language
>> guessing and can convince the LyX developer community of it and has the
>> resources to implement and test it - it may happen. Another option
>> would be to have a spell checker backend including this feature.
>
> OK. Sadly I am not that person, but I am very glad that this
> discussion has occurred and at least the potential for improvements
> identified.
>
>>> But alas, the user is still utterly laboured with tedious repetition
>>> of language specification (also text style selection, with the hack i
>>> use), and will remain so until LyX UI changes.
>>
>> Do you have an example for such a document?
>
> I will email you immediately with a copy.
>
>>>> But it can be tricky to make it right. It heavily depends on the spell checker -
>>>> aspell e. g. accepts completely different "alternative" language settings as
>>>> hunspell or apples spell checker do. And it depends on the runtime-environment -
>>>> what dictionaries are available for the user on the current machine.
>>>> And we have the feature to switch between the spell checker back ends at runtime.
>>>
>>> This sounds ugly.  Is there any similarity between spell checking APIs?  Is
>>> there a cross platform, spell checking library unification / abstraction layer
>>> available? Would it be worth developing one? How difficult is it to detect
>>> known dictionaries and spell checkers on a cross-platform basis?
>>
>> I'll cite my own investigation about similarity between spell checking APIs.
>> The focus was the management of personal word lists.
>>
>>> We have support for different spell checker backends.
>>> All of them are able to check words, of course.
>>> But the capabilities with personal word lists differs horrible.
>>> The following table presents the results of my investigation.
>>>
>>> Feature     | aspell | native (mac) | enchant | hunspell
>>> ========================================================
>>> check       | +      | +            | +       | +
>>> suggest     | +      | +            | +       | +
>>> accept      | +      | +            | +       | +
>>> insert      | +      | +            | o (2)   | o (3)
>>> ispersonal? | o (1)  | +            | -       | -
>>> remove      | -      | +            | + (4)   | -
>>>
>>> Legend:
>>> + feature is supported
>>> - feature is not supported
>>> o there are limitations:
>>> 1) aspell has the interface to enumerate the personal word list.
>>>   So it's possible to implement, I have a patch for LyX at hand.
>>> 2) The versions below 1.6.0 are truncating the personal word list
>>>   on open - effectively no personal word list available after restart.
>>> 3) There is no persistent state for personal word lists.
>> (4) Enchant manages it's own personal word lists.
>
> Thanks for this very interesting review.
>
> Points (2) and (3), plus the lack of a remove feature for some
> engines, seem to be the sorts of things that comments upstream
> could fix.  In addition, (3) could be worked around by LyX pretty
> easily (though such code is always best avoided), and both (2)
> and missing remove functionality seem not critical (just annoying).
>
> Could you please clarify what the purpose of 'ispersonal?' is at the
> LyX UI level?  I'm guessing it's a feature to test whether a word
> came from a "standard" dictionary or from a "personal" dictionary,
> but I am unsure of whether this distinction is useful at all for LyX?
>
>>> There is some rumor on the net already to consolidate the spelling
>>> for the whole desktop.
>>> https://wiki.ubuntu.com/ConsolidateSpellingLibs
>>> I don't know how long it would last to get some result.
>
> This seems to be a move for pushing Ubuntu towards hunspell
> and enchant.  hunspell is claimed to be the most modern
> implementation of a multilingual spell checking backend.
>
> However, its language support is not complete, so users of
> some languages still require support for other, specialised
> engines (Voikko for Finnish, Zemberek for Turkish, Uspell for
> Yiddish, Hebrew, and Eastern European languages, Hspell
> for Hebrew) which enchant can provide.
>
> Contrary to your prior email, it seems that enchant does
> work on OSX, since support for AppleSpell (Mac OSX) is
> claimed @ http://www.abisource.com/projects/enchant/
>
> (Note: Ubuntu cites problems preventing the dropping of
> older implementations due to the lack of upstream hunspell
> or enchant support in some major software packages, such
> as PHP. GTKSpell and KDE use enchant already).
>
>>>>> 3. Wider problem of spellchecking and multilingual support
>>>>>   -------------------------------------------------------
>>>>>   Regarding points 1 and 2, really there is a wider problem of multilingual
>>>>>   support being a little 'all over the place', with a bunch of different
>>>>>   "solutions" in use.  In terms of LyX, none of these are really "solutions"
>>>>>   as even with LyX 2.0beta1 it appears to be demonstrably impossible to link
>>>>>   the manual language markup made in conjunction with a font-linked solution
>>>>>   to the manual language markup required for spellchecking purposes.
>>>>
>>>> Sorry, I cannot follow you. The language you assigned a word, phrase or paragraph
>>>> is used for spell checking of the given words in this area. Do you refer to the
>>>> fact that it's possible to mark two parts of a word with two languages?
>>>
>>> Sorry for my lack of clarity.  Let me try again.
>>>
>>> Right now there are three concepts, none of them really linked by LyX
>>> without customisation:
>>> - text style
>>> - language
>>> - font
>>>
>>> Only partial solutions exist for relating these together.
>>>
>>> None of them seem to provide for particularly good user experience.
>>>
>>> IMHO the number of hacks developed in this area show clear community
>>> frustration with the status quo.
>>>
>>> In short, whilst LyX is a great tool, it is in areas like this that it
>>> occurs to me that perhaps LyX can go so much further by tackling some of "the hard
>>> problems" such as complex multilingual use cases.
>>
>> In recent years, I read every now and then mails that proposed the elimination
>> of manual text markup - at least to hide them better... but it didn't happen.
>>
>> People want to have the text styles to be able to "finger paint" the text.
>
> I can see the requirement for this, for example in linguistics quite
> frequently authors wish to show multiple forms of the same glyph in
> different fonts.  That's a normal requirement (plus a critical design-level
> assumption in the design of unicode and other character sets to address
> problem domains they do not solve...)
>
> An elegant solution for multilingual text markup would be one that is entirely
> separate from 'text style' blocks.  Marking up language should be just
> that, and spell checking and (La/Xe)TeX-layer (babel) hyphenation should
> run exclusively against this information.
>
> (Unfortunately, personally at this time I have found no way to make this
> happen with LyX-2.0.0beta3, please see the hacks and manual markup
> used in the privately supplied sample document)
>
>>>>>   As per previous posts whereby I suggested revising the user interface to
>>>>>   make proper use of available databases and let the user assign fonts
>>>>>   to unicode blocks and/or languages and/or custom defined text-types for
>>>>>   font selection purposes, a forward-looking, integrated solution should
>>>>>   also take in to account spellchecker requirements.
>>>>>
>>>>>   Otherwise, we poor users are laboured with having to make 1000 manual
>>>>>   markups just to include a short bit of text!
>>
>> This sounds a bit exaggeratedly...
>
> Please see the document supplied. Really, it is tedious to mention a place or
> person with romanised, native and translated forms when there are so many
> steps required to mark each portion up for spellcheckers + hyphenation
> (language) and correct font selection (text style, with the hack I am using,
> which was shared by another user the previous time I posted on multilingual
> issues, and remains the only method I have found to achieve the output I
> am looking for.)
>
>>>>>   This is exemplified if,
>>>>>   for instance, one wishes to quote a place name with translations and their
>>>>>   romanised equivalents in situ at many points throughout a document
>>>>>   (my unfortunate situation, and before anyone asks: no I cannot switch to
>>>>>   compiling a reference table, for reasons of readership and readability)
>>
>> You may copy the place names to many points in your document.
>
> The personal and geographic names are numerous and differ throughout the
> document.  Please see relevant portions of the sample document I have sent
> privately.  Their number will grow significantly as the text develops.
>
>>>>> 4. Weird behaviour with common prefixes and specialist compounds [X]
>>>>>   -----------------------------------------------------------------
>>>>>   Common prefixes such as micro and proto seem to confuse aspell.  Not sure
>>>>>   if this is somehow related to how it is linked from LyX, but I assume the
>>>>>   issue is with them.  For example, 'proto-<known word>' does not seem to
>>>>>   be accepted, forcing 'proto' to be added manually as a valid word.
>>>>>   Unfortunately, the LyX interface does not offer a proper workaround.
>>>>>   (Please see point 5.)
>>>>>   (Note: Upon further investigation, actually a lot of words appear to be
>>>>>    missing from the default dictionary, including "hewn", "proven",
>>>>>    "romanised". A scrabble player would be dismayed: for many points!)
>>>>>   (PS: Did anyone ever wonder about the etymology of 'hardscrabble'? I think
>>>>>    aspell's default English dictionary could be involved in at least one
>>>>>    definition...)
>>>>
>>>> This one I have to investigate, cannot comment on this now.
>>>>
>>>> But, AFAIK there is no default aspell dictionary. It depends on the
>>>> software packager what gets distributed. You may have an installation
>>>> with german dictionary only. And there are different english dictionaries
>>>> available...
>>>>
>>>> This is, what my aspell installation has to offer for english:
>>>> * en, en-w_accents, en-wo_accents
>>>> * en-variant_0, en-variant_1, en-variant_2
>>>> * en_CA, en_CA-w_accents, en_CA-wo_accents
>>>> * en_GB, en_GB-w_accents, en_GB-wo_accents
>>>> * en_GB-ise, en_GB-ise-w_accents, en_GB-ise-wo_accents
>>>> * en_GB-ize, en_GB-ize-w_accents, en_GB-ize-wo_accents
>>>> * en_US, en_US-w_accents, en_US-wo_accents
>>>>
>>>> Some of them are combined dictionaries.
>>>
>>> Perhaps a summary could be made available of dictionary contents,
>>> either through built-in descriptions and/or the proposed pan-spellchecker-
>>> engine abstraction/unification library, hmmm?
>>
>> Perhaps. Currently I don't know of a usable pan-spellchecker-engine.
>
> If the resources and experience of programmers on much larger projects
> are anything to judge by, then a/ispell seems to be pretty much obsolete
> and in the process of becoming a historical footnote, with hunspell
> greatly promoted in its stead.
>
> Your previous email cites missing OSX support (apparently since remedied)
> and the lack of "language variety support" as reasons not to use enchant.
>
> Could you please explain the second point more clearly?
>
> (If there is really a problem, it seems to me that in seeking to become a
> dominant, unified engine, the enchant project developers would be very
> keen to hear of any needs that are not met by their API/ABI...)
>
>>>> Another option is to switch to hunspell and use the openoffice dictionaries.
>>>> It is said that these dictionaries are superior.
>>>
>>> Thanks, I will try it. An excellent tip!
>>> (Of course it would be better if the UI suggested this or even
>>> detected availability...)
>
> After reading up further, I can see the problem with this idea (missing
> language support offered by enchant) and instead would like to first
> explore any limitations of LyX's apparent enchant support.
>
>> If you cannot see it in your UI you cannot use it. Then it's not available.
>> BTW, what version of LyX you're using?
>
> LyX-2.0.0beta3, recently upgraded from LyX-2.0.0beta1 on Gentoo Linux.
>
>>>>> 5. Right sidebar spellchecker interface: word addition [*]
>>>>>   -------------------------------------------------------
>>>>>   At various points throughout my document I use accepted phrases within
>>>>>   the sphere of my writing such as "Proto-Austro-Tai" and "Tai-Kadai".
>>>>>
>>>>>   Whilst "Tai" and "Kadai" are also used as individual words, "Proto" and
>>>>>   "Austro" are not.  With the present spellchecker interface, when such
>>>>>   'word portions' occur, I am only given two options:
>>>>>
>>>>>    1. Adding these 'word portions' as words in their own right
>>>>>    2. Ignoring them as words in their own right
>>>>>
>>>>>   Both options are less than ideal because they will subsequently allow
>>>>>   the individual words to occur alone, ie: such that human input could
>>>>>   conceivably render "Come hither, pronto!" as "Come hither, proto!" and
>>>>>   the spellchecker would consider this to be correct, despite the fact
>>>>>   that proto should possibly not occur as a word in its own right.
>>>>>   (OK well that's probably arguable, but you still see the point!)
>>>>>
>>>>>   The best option for resolving this would be to modify the LyX spellchecker
>>>>>   sidebar interface to allow adding arbitrary words or entire words rather
>>>>>   than simply word portions thereof that have been identified by aspell as
>>>>>   unknown.  (ie: When presented with "Proto-Austro-Tai", and "Proto" is
>>>>>   highlighted, then the user should be able to add "Proto-Austro-Tai" as
>>>>>   a word in its own right rather than only the 'word portion' "Proto" itself.)
>>>>>   (If I recall, 'other' word processing solutions include this feature.)
>>>>
>>>> Here LyX relies on the spell checker interface. Most checkers are able to do
>>>> the checks at word level only. Consequently you cannot add compound words to
>>>> your personal dictionary, AFAIK. Here I want to wait for an improvement of the
>>>> spell checker libraries. It's possible to check complete sentences -
>>>> the apple spell checker has this capability. It's even able to auto-dectect
>>>> the language...
>>>
>>> (Well "they" do say that Apple is very good at usability, and that open source
>>> generally isn't.  Perhaps "they" are correct in this assertion, sometimes...)
>>
>> Apple is using the hunspell spell checker engine internally for the spell service
>> of the OS (since Snow Leopard). But obviously they added some code to make the
>> grammar checking and multi-language detection possible.
>
> Ooh, interesting!
>
> Which libraries support grammar checking? (It was not mentioned in your
> research summary table.)  A grammar checking feature seems a 'bonus' rather
> than a requirement though, to me at least. Spell checking seems much more
> important.  (In my experience, grammar checking in 'other' word
> processing solutions
> has tended to be a pain rather than a pleasure, at least in one's
> native language.)
>
> On the second point of multi-language detection...
>
>> The spell checker on mac
>> does what you want, AFAIK. It uses heuristics to choose the language if you don't
>> provide the information. But then - I have seen it in Apples Mail tool - a spanish
>> word in a french sentence will be marked as misspelled. Perhaps you would be
>> disappointed again by the fact being forced to mark that spanish word manually.
>
> For languages with a common character set and similar vocabularies, this is
> always going to be a requirement.  Certainly, the detection of language is often
> possible at the phrase level, however at the level of a word sometimes it's
> impossible.
>
> Notwithstanding the above, when preparing documents such as mine that feature
> many short phrases in languages that are unicode-block-distinct (Chinese, Thai,
> Lao, Arabic, Mongolian, Greek, Yi, etc.), it seems a ridiculous UI proposition
> not to support some form of automated language markup at the time of input,
> since the vast majority of language changes in a document are going to be
> more obvious and to continue for more than the length of a single word.
>
>> I've heard they returned some contribution to the community (OpenSpell)...
>
> I could not find any information on this, only:
> http://openxspell.sourceforge.net/
>
> In any case, the real question seems to be: "What is the best way to
> implement language detection, for the purposes of preventing the need to
> finger paint language markup and thus enhance ease of use where multiple
> languages are frequently mixed within a single document?"
>
> (For the purposes of this discussion, let's define 'best' as "relatively easy
> to implement" and 'forward looking' (ie: not going to cause issues) from
> a developer perspective, and 'relieves as wide as possible a user base
> of tedium' from a user perspective.)
>
> Leaving aside for a moment the other annoyances such as font, TeX
> backend and spell-checking backend selection in such multilingual
> documents, and working on the dual assumptions that "it is
> necessary at a first stage to mark up the language of a piece of
> text", and "tedium in this process is to be avoided", then it seems
> to me that there are two approaches available:
>  (1) Dictionary-based language detection
>  (2) Unicode code block change based language detection
>
> Both methods are imperfect and will still require some finger-painting,
> however any automation will be of great use to the user.
>
> Regarding method (1), which Apple has apparently used throughout OSX
> (Google also provides "detect language" feature on it's Google Translate
> service), there are some inherent limitations.
>  (1A) [Missing dictionary]: A dictionary must be available in the
> language in question.
>  (1B) [Ambiguity]: Certain short phrases may exist in multiple
> languages and thus prevent reliable detection.
>  (1C) [Many languages]: CPU overhead may be high in the case that many
> languages are in concurrent use and dictionary-based detection, ie:
> method (1), is the initial, sole and primary means of language
> detection.
>
> In short, method (1) appears to be best suited to those environments
> in which monolingual text is being prepared in a popular language for
> which dictionaries exist, and where the text's length exceeds a single
> word or phrase.  As a large part of LyX's community seems to be either
> academic or technical, they are less likely than a "general user base"
> to be limited to the production of such documents.  In particular:
>  - The use of pre-modern, extinct or obscure languages or historical
> language forms will ensure that dictionaries are missing relatively
> frequently.
>  - It is highly likely that multiple languages are present in a single document.
>  - It is relatively likely that many languages are present in a single document.
>
> Let's now look quickly at method (2), unicode code-block based
> detection.  This is the method that I originally proposed to the list
> at http://www.mail-archive.com/lyx-users@lists.lyx.org/msg83635.html
> (attachment: http://pratyeka.org/unicode-font-mockup.png) but which
> you said you personally lack time/motivation to read but which I feel
> I must again reference.
>
> Under this method, no dictionary-based detection is required.  The
> user is given the ability to auto-associate text input under a certain
> unicode codeblock with a certain language, and input is automatically
> marked up.  Whilst multiple languages per codeblock are supported,
> they would require either finger-painting or a fallback to (1) (see
> below for notes on combined methods)
>
> Having reviewed the two methods briefly, it is obvious that there is
> also potential for the alternative of a combination of method (1) and
> (2) to resolve implementation challenges (in particular, straight
> inefficiency hassles displayed with many-language deployments of
> method (1)). A simple combined technique may be to use (2) to limit
> searched dictionary languages to languages existing in the unicode
> codeblock specified, then method (1) to perform language detection.
>
> Other techniques such as statistical analysis are also possible,
> though far beyond the scope of LyX. Even a database matching
> dictionary languages to unicode blocks, however, almost certainly does
> not belong in LyX. (Bloat, "do one thing and do it well", KISS, etc.)
>  Thus in my original email (should you be able to find the time to
> read it!) I suggested a prescriptive UI whereby the user is able to
> define their own expectations of which languages exist per unicode
> block within their document.  This also provides the side benefit of
> efficiency, since 'primary language' (the default language to which
> which input in that unicode block automatically sets language markup)
> could also be 'primary dictionary' for language detection (eg: when
> pasting large blocks of text, opening legacy files, falling back to
> (1) from (2) as described above, or whatever...).
>
> Finally, some left-field techniques may also be available, like making
> API calls out to Google Translate for 'outsourced' language detection!
>
>>>>> 6. Dictionary Re-Use Support [*]
>>>>>   -----------------------------
>>>>>   Another point is that of re-use.  Which is to say that, when someone uses
>>>>>   for example 'BibTeX' to compile a biliographic database, that database
>>>>>   may easily be used with other projects and is considered portable.  So
>>>>>   for all physics papers I can use one bibliography, and I may have another
>>>>>   for history papers.  Whilst this is presently handled adequately by LyX,
>>>>>   the equivalent functionality is not present for dictionary databases.
>>>>>   It should be.  This means both adding a 'manage multiple dictionaries in
>>>>>   this project' feature-set, and adding a 'which dictionary do you want to
>>>>>   add the word to' drop-down in the right hand spellchecker sidebar.
>>>>
>>>> This is a good idea (already mentioned on developers list, AFAICR).
>>>> The idea is to incorporate a personal dictionary into the document.
>>>> But it definitively will not happen tomorrow.
>>>
>>> Great, as long as the personal dictionary "in" the document is saved "outside"
>>> the document and as a file that can be:
>>> a) shared between multiple documents
>>> b) used with zero or more additional personal dictionaries within a single file
>>> c) identified as the target dictionary (vs. other active personal
>>> dictionaries) when spell-checking the document and adding words to
>>> personal dictionaries
>>
>> The externally saved personal dictionary is shared between multiple documents per se.
>
> I am unclear on the suggested implementation, however any assumption that
> all documents a user is editing should use the same personal
> dictionary is flawed.
> From your next comment, I think you see this...
>
>> My idea was to provide the dictionary "inside" the document as an alternative.
>> If you are sure a word is correctly spelled you want to transfer this know how
>> with the document.
>
> This is a good idea however it is not nearly as flexible as the established and
> proven BiBTeX approach, ie: the storage of dictionaries in external dictionary
> files that can be linked to documents on a many-to-many basis. This
> approach facilitates control for the user ("my physics dictionary", "my
> history dictionary", "my computer science dictionary") without making
> unnecessary assumptions about the similarity of ALL documents that a
> user edits or the number of such domains in which a user actively writes.
>
>> To have multiple private dictionaries is a third option with - like the second one -
>> a much higher demand on the usability of the spell checker interface.
>
> This is my suggestion and the only way I can see LyX providing a
> competent and future-proof solution for spell checking.
>
> Did your research reveal which interface(s) did or did not support
> this approach?
>
> This seems a legitimate need that should be pushed upstream if unavailable
> in hunspell and enchant.  Enchant seems the sort of project where
> implementing a costly emulation layer against backends missing this
> support may occur "for free", at least in terms of the LyX project's developer
> time, if a request is acknowledged.  The motivation for enchant's developers
> is that this would further assist the enchant project to differentiate enchant's
> API/ABI from hunspell's API/ABI in the eyes of application developers,
> since hunspell seems to be rapidly subsuming prior systems and value of
> the enchant layer is therefore decreasing for some development scenarios.
>
> In addition, enchant probably already has code to read/write external
> dictionary files of the sort suggested ("my <x-domain> dictionary"...) and
> could expose an API/ABI for doing so that LyX could easily utilise. Any
> functionality identified as missing may be enthusiastically implemented
> by the enchant developers.
>
> Hell, they might even extend their scope slightly to throw in some
> critical functions for language detection!  (Indeed, on the fact of it it would
> seem that there is little point in adding any such code to LyX vs. enchant,
> a cross-platform library that already provides concurrent access to
> multiple spellchecking engine's dictionaries.  Do one thing and do it well,
> the KISS principle?)
>
> Sorry I do not have time to go back and snip all of this, I have been typing
> and researching the above since before the dawn (sunrise over Los
> Angeles is surprisingly beautiful!) and am now running late for work!
>
> Very keen to hear thoughts...
>
> - Walter
>

-- 
"I like to pay taxes. With them, I buy civilization." --  Oliver Wendell Holmes
Received on Tue Feb 1 00:29:18 2011

This archive was generated by hypermail 2.1.8 : Tue Feb 01 2011 - 00:29:18 CET