Fwd: Subject: LyX 2.0beta3: Spell Checking + Multilingualism

From: Walter <walter.stanish_at_gmail.com>
Date: Mon Jan 31 2011 - 21:20:21 CET

Hi there,

Sorry for the rather random request, but would you mind reviewing (the
bottom portion of) this email?

It relates to potential scope and features for enchant (vs. hunspell,
etc.) as they may pertain to assisting with the ongoing development of
LyX, a LaTeX frontend available from http://www.lyx.org/

It appears to me that enchant and LyX could be great allies.

Therefore it would be great if you could scan through the following
and correct any misconceptions, since it is really not my area of
expertise and I am uncertain whether the volunteers within the LyX
development community are familiar with the latest features of
enchant.

Thanks for your work and best wishes,
Walter

---------- Forwarded message ----------
From: Walter <walter.stanish@gmail.com>
Date: 28 January 2011 08:43
Subject: Re: Subject: LyX 2.0beta3: Spell Checking + Multilingualism
To: LyX Users <lyx-users@lists.lyx.org>
Cc: LyX Devel <lyx-devel@lists.lyx.org>

>> http://www.mail-archive.com/lyx-users@lists.lyx.org/msg83713.html
>
> Ok, AFAIU this refers to the XeLaTeX engine and not to LyX.
> Of course, if someone wants to develop a solid algorithm for language
> guessing and can convince the LyX developer community of it and has the
> resources to implement and test it - it may happen. Another option
> would be to have a spell checker backend including this feature.

OK. Sadly I am not that person, but I am very glad that this
discussion has occurred and at least the potential for improvements
identified.

>> But alas, the user is still utterly laboured with tedious repetition
>> of language specification (also text style selection, with the hack i
>> use), and will remain so until LyX UI changes.
>
> Do you have an example for such a document?

I will email you immediately with a copy.

>>> But it can be tricky to make it right. It heavily depends on the spell checker -
>>> aspell e. g. accepts completely different "alternative" language settings as
>>> hunspell or apples spell checker do. And it depends on the runtime-environment -
>>> what dictionaries are available for the user on the current machine.
>>> And we have the feature to switch between the spell checker back ends at runtime.
>>
>> This sounds ugly.  Is there any similarity between spell checking APIs?  Is
>> there a cross platform, spell checking library unification / abstraction layer
>> available? Would it be worth developing one? How difficult is it to detect
>> known dictionaries and spell checkers on a cross-platform basis?
>
> I'll cite my own investigation about similarity between spell checking APIs.
> The focus was the management of personal word lists.
>
>> We have support for different spell checker backends.
>> All of them are able to check words, of course.
>> But the capabilities with personal word lists differs horrible.
>> The following table presents the results of my investigation.
>>
>> Feature     | aspell | native (mac) | enchant | hunspell
>> ========================================================
>> check       | +      | +            | +       | +
>> suggest     | +      | +            | +       | +
>> accept      | +      | +            | +       | +
>> insert      | +      | +            | o (2)   | o (3)
>> ispersonal? | o (1)  | +            | -       | -
>> remove      | -      | +            | + (4)   | -
>>
>> Legend:
>> + feature is supported
>> - feature is not supported
>> o there are limitations:
>> 1) aspell has the interface to enumerate the personal word list.
>>   So it's possible to implement, I have a patch for LyX at hand.
>> 2) The versions below 1.6.0 are truncating the personal word list
>>   on open - effectively no personal word list available after restart.
>> 3) There is no persistent state for personal word lists.
> (4) Enchant manages it's own personal word lists.

Thanks for this very interesting review.

Points (2) and (3), plus the lack of a remove feature for some
engines, seem to be the sorts of things that comments upstream
could fix.  In addition, (3) could be worked around by LyX pretty
easily (though such code is always best avoided), and both (2)
and missing remove functionality seem not critical (just annoying).

Could you please clarify what the purpose of 'ispersonal?' is at the
LyX UI level?  I'm guessing it's a feature to test whether a word
came from a "standard" dictionary or from a "personal" dictionary,
but I am unsure of whether this distinction is useful at all for LyX?

>> There is some rumor on the net already to consolidate the spelling
>> for the whole desktop.
>> https://wiki.ubuntu.com/ConsolidateSpellingLibs
>> I don't know how long it would last to get some result.

This seems to be a move for pushing Ubuntu towards hunspell
and enchant.  hunspell is claimed to be the most modern
implementation of a multilingual spell checking backend.

However, its language support is not complete, so users of
some languages still require support for other, specialised
engines (Voikko for Finnish, Zemberek for Turkish, Uspell for
Yiddish, Hebrew, and Eastern European languages, Hspell
for Hebrew) which enchant can provide.

Contrary to your prior email, it seems that enchant does
work on OSX, since support for AppleSpell (Mac OSX) is
claimed @ http://www.abisource.com/projects/enchant/

(Note: Ubuntu cites problems preventing the dropping of
older implementations due to the lack of upstream hunspell
or enchant support in some major software packages, such
as PHP. GTKSpell and KDE use enchant already).

>>>> 3. Wider problem of spellchecking and multilingual support
>>>>   -------------------------------------------------------
>>>>   Regarding points 1 and 2, really there is a wider problem of multilingual
>>>>   support being a little 'all over the place', with a bunch of different
>>>>   "solutions" in use.  In terms of LyX, none of these are really "solutions"
>>>>   as even with LyX 2.0beta1 it appears to be demonstrably impossible to link
>>>>   the manual language markup made in conjunction with a font-linked solution
>>>>   to the manual language markup required for spellchecking purposes.
>>>
>>> Sorry, I cannot follow you. The language you assigned a word, phrase or paragraph
>>> is used for spell checking of the given words in this area. Do you refer to the
>>> fact that it's possible to mark two parts of a word with two languages?
>>
>> Sorry for my lack of clarity.  Let me try again.
>>
>> Right now there are three concepts, none of them really linked by LyX
>> without customisation:
>> - text style
>> - language
>> - font
>>
>> Only partial solutions exist for relating these together.
>>
>> None of them seem to provide for particularly good user experience.
>>
>> IMHO the number of hacks developed in this area show clear community
>> frustration with the status quo.
>>
>> In short, whilst LyX is a great tool, it is in areas like this that it
>> occurs to me that perhaps LyX can go so much further by tackling some of "the hard
>> problems" such as complex multilingual use cases.
>
> In recent years, I read every now and then mails that proposed the elimination
> of manual text markup - at least to hide them better... but it didn't happen.
>
> People want to have the text styles to be able to "finger paint" the text.

I can see the requirement for this, for example in linguistics quite
frequently authors wish to show multiple forms of the same glyph in
different fonts.  That's a normal requirement (plus a critical design-level
assumption in the design of unicode and other character sets to address
problem domains they do not solve...)

An elegant solution for multilingual text markup would be one that is entirely
separate from 'text style' blocks.  Marking up language should be just
that, and spell checking and (La/Xe)TeX-layer (babel) hyphenation should
run exclusively against this information.

(Unfortunately, personally at this time I have found no way to make this
happen with LyX-2.0.0beta3, please see the hacks and manual markup
used in the privately supplied sample document)

>>>>   As per previous posts whereby I suggested revising the user interface to
>>>>   make proper use of available databases and let the user assign fonts
>>>>   to unicode blocks and/or languages and/or custom defined text-types for
>>>>   font selection purposes, a forward-looking, integrated solution should
>>>>   also take in to account spellchecker requirements.
>>>>
>>>>   Otherwise, we poor users are laboured with having to make 1000 manual
>>>>   markups just to include a short bit of text!
>
> This sounds a bit exaggeratedly...

Please see the document supplied. Really, it is tedious to mention a place or
person with romanised, native and translated forms when there are so many
steps required to mark each portion up for spellcheckers + hyphenation
(language) and correct font selection (text style, with the hack I am using,
which was shared by another user the previous time I posted on multilingual
issues, and remains the only method I have found to achieve the output I
am looking for.)

>>>>   This is exemplified if,
>>>>   for instance, one wishes to quote a place name with translations and their
>>>>   romanised equivalents in situ at many points throughout a document
>>>>   (my unfortunate situation, and before anyone asks: no I cannot switch to
>>>>   compiling a reference table, for reasons of readership and readability)
>
> You may copy the place names to many points in your document.

The personal and geographic names are numerous and differ throughout the
document.  Please see relevant portions of the sample document I have sent
privately.  Their number will grow significantly as the text develops.

>>>> 4. Weird behaviour with common prefixes and specialist compounds [X]
>>>>   -----------------------------------------------------------------
>>>>   Common prefixes such as micro and proto seem to confuse aspell.  Not sure
>>>>   if this is somehow related to how it is linked from LyX, but I assume the
>>>>   issue is with them.  For example, 'proto-<known word>' does not seem to
>>>>   be accepted, forcing 'proto' to be added manually as a valid word.
>>>>   Unfortunately, the LyX interface does not offer a proper workaround.
>>>>   (Please see point 5.)
>>>>   (Note: Upon further investigation, actually a lot of words appear to be
>>>>    missing from the default dictionary, including "hewn", "proven",
>>>>    "romanised". A scrabble player would be dismayed: for many points!)
>>>>   (PS: Did anyone ever wonder about the etymology of 'hardscrabble'? I think
>>>>    aspell's default English dictionary could be involved in at least one
>>>>    definition...)
>>>
>>> This one I have to investigate, cannot comment on this now.
>>>
>>> But, AFAIK there is no default aspell dictionary. It depends on the
>>> software packager what gets distributed. You may have an installation
>>> with german dictionary only. And there are different english dictionaries
>>> available...
>>>
>>> This is, what my aspell installation has to offer for english:
>>> * en, en-w_accents, en-wo_accents
>>> * en-variant_0, en-variant_1, en-variant_2
>>> * en_CA, en_CA-w_accents, en_CA-wo_accents
>>> * en_GB, en_GB-w_accents, en_GB-wo_accents
>>> * en_GB-ise, en_GB-ise-w_accents, en_GB-ise-wo_accents
>>> * en_GB-ize, en_GB-ize-w_accents, en_GB-ize-wo_accents
>>> * en_US, en_US-w_accents, en_US-wo_accents
>>>
>>> Some of them are combined dictionaries.
>>
>> Perhaps a summary could be made available of dictionary contents,
>> either through built-in descriptions and/or the proposed pan-spellchecker-
>> engine abstraction/unification library, hmmm?
>
> Perhaps. Currently I don't know of a usable pan-spellchecker-engine.

If the resources and experience of programmers on much larger projects
are anything to judge by, then a/ispell seems to be pretty much obsolete
and in the process of becoming a historical footnote, with hunspell
greatly promoted in its stead.

Your previous email cites missing OSX support (apparently since remedied)
and the lack of "language variety support" as reasons not to use enchant.

Could you please explain the second point more clearly?

(If there is really a problem, it seems to me that in seeking to become a
dominant, unified engine, the enchant project developers would be very
keen to hear of any needs that are not met by their API/ABI...)

>>> Another option is to switch to hunspell and use the openoffice dictionaries.
>>> It is said that these dictionaries are superior.
>>
>> Thanks, I will try it. An excellent tip!
>> (Of course it would be better if the UI suggested this or even
>> detected availability...)

After reading up further, I can see the problem with this idea (missing
language support offered by enchant) and instead would like to first
explore any limitations of LyX's apparent enchant support.

> If you cannot see it in your UI you cannot use it. Then it's not available.
> BTW, what version of LyX you're using?

LyX-2.0.0beta3, recently upgraded from LyX-2.0.0beta1 on Gentoo Linux.

>>>> 5. Right sidebar spellchecker interface: word addition [*]
>>>>   -------------------------------------------------------
>>>>   At various points throughout my document I use accepted phrases within
>>>>   the sphere of my writing such as "Proto-Austro-Tai" and "Tai-Kadai".
>>>>
>>>>   Whilst "Tai" and "Kadai" are also used as individual words, "Proto" and
>>>>   "Austro" are not.  With the present spellchecker interface, when such
>>>>   'word portions' occur, I am only given two options:
>>>>
>>>>    1. Adding these 'word portions' as words in their own right
>>>>    2. Ignoring them as words in their own right
>>>>
>>>>   Both options are less than ideal because they will subsequently allow
>>>>   the individual words to occur alone, ie: such that human input could
>>>>   conceivably render "Come hither, pronto!" as "Come hither, proto!" and
>>>>   the spellchecker would consider this to be correct, despite the fact
>>>>   that proto should possibly not occur as a word in its own right.
>>>>   (OK well that's probably arguable, but you still see the point!)
>>>>
>>>>   The best option for resolving this would be to modify the LyX spellchecker
>>>>   sidebar interface to allow adding arbitrary words or entire words rather
>>>>   than simply word portions thereof that have been identified by aspell as
>>>>   unknown.  (ie: When presented with "Proto-Austro-Tai", and "Proto" is
>>>>   highlighted, then the user should be able to add "Proto-Austro-Tai" as
>>>>   a word in its own right rather than only the 'word portion' "Proto" itself.)
>>>>   (If I recall, 'other' word processing solutions include this feature.)
>>>
>>> Here LyX relies on the spell checker interface. Most checkers are able to do
>>> the checks at word level only. Consequently you cannot add compound words to
>>> your personal dictionary, AFAIK. Here I want to wait for an improvement of the
>>> spell checker libraries. It's possible to check complete sentences -
>>> the apple spell checker has this capability. It's even able to auto-dectect
>>> the language...
>>
>> (Well "they" do say that Apple is very good at usability, and that open source
>> generally isn't.  Perhaps "they" are correct in this assertion, sometimes...)
>
> Apple is using the hunspell spell checker engine internally for the spell service
> of the OS (since Snow Leopard). But obviously they added some code to make the
> grammar checking and multi-language detection possible.

Ooh, interesting!

Which libraries support grammar checking? (It was not mentioned in your
research summary table.)  A grammar checking feature seems a 'bonus' rather
than a requirement though, to me at least. Spell checking seems much more
important.  (In my experience, grammar checking in 'other' word
processing solutions
has tended to be a pain rather than a pleasure, at least in one's
native language.)

On the second point of multi-language detection...

> The spell checker on mac
> does what you want, AFAIK. It uses heuristics to choose the language if you don't
> provide the information. But then - I have seen it in Apples Mail tool - a spanish
> word in a french sentence will be marked as misspelled. Perhaps you would be
> disappointed again by the fact being forced to mark that spanish word manually.

For languages with a common character set and similar vocabularies, this is
always going to be a requirement.  Certainly, the detection of language is often
possible at the phrase level, however at the level of a word sometimes it's
impossible.

Notwithstanding the above, when preparing documents such as mine that feature
many short phrases in languages that are unicode-block-distinct (Chinese, Thai,
Lao, Arabic, Mongolian, Greek, Yi, etc.), it seems a ridiculous UI proposition
not to support some form of automated language markup at the time of input,
since the vast majority of language changes in a document are going to be
more obvious and to continue for more than the length of a single word.

> I've heard they returned some contribution to the community (OpenSpell)...

I could not find any information on this, only:
http://openxspell.sourceforge.net/

In any case, the real question seems to be: "What is the best way to
implement language detection, for the purposes of preventing the need to
finger paint language markup and thus enhance ease of use where multiple
languages are frequently mixed within a single document?"

(For the purposes of this discussion, let's define 'best' as "relatively easy
to implement" and 'forward looking' (ie: not going to cause issues) from
a developer perspective, and 'relieves as wide as possible a user base
of tedium' from a user perspective.)

Leaving aside for a moment the other annoyances such as font, TeX
backend and spell-checking backend selection in such multilingual
documents, and working on the dual assumptions that "it is
necessary at a first stage to mark up the language of a piece of
text", and "tedium in this process is to be avoided", then it seems
to me that there are two approaches available:
 (1) Dictionary-based language detection
 (2) Unicode code block change based language detection

Both methods are imperfect and will still require some finger-painting,
however any automation will be of great use to the user.

Regarding method (1), which Apple has apparently used throughout OSX
(Google also provides "detect language" feature on it's Google Translate
service), there are some inherent limitations.
 (1A) [Missing dictionary]: A dictionary must be available in the
language in question.
 (1B) [Ambiguity]: Certain short phrases may exist in multiple
languages and thus prevent reliable detection.
 (1C) [Many languages]: CPU overhead may be high in the case that many
languages are in concurrent use and dictionary-based detection, ie:
method (1), is the initial, sole and primary means of language
detection.

In short, method (1) appears to be best suited to those environments
in which monolingual text is being prepared in a popular language for
which dictionaries exist, and where the text's length exceeds a single
word or phrase.  As a large part of LyX's community seems to be either
academic or technical, they are less likely than a "general user base"
to be limited to the production of such documents.  In particular:
 - The use of pre-modern, extinct or obscure languages or historical
language forms will ensure that dictionaries are missing relatively
frequently.
 - It is highly likely that multiple languages are present in a single document.
 - It is relatively likely that many languages are present in a single document.

Let's now look quickly at method (2), unicode code-block based
detection.  This is the method that I originally proposed to the list
at http://www.mail-archive.com/lyx-users@lists.lyx.org/msg83635.html
(attachment: http://pratyeka.org/unicode-font-mockup.png) but which
you said you personally lack time/motivation to read but which I feel
I must again reference.

Under this method, no dictionary-based detection is required.  The
user is given the ability to auto-associate text input under a certain
unicode codeblock with a certain language, and input is automatically
marked up.  Whilst multiple languages per codeblock are supported,
they would require either finger-painting or a fallback to (1) (see
below for notes on combined methods)

Having reviewed the two methods briefly, it is obvious that there is
also potential for the alternative of a combination of method (1) and
(2) to resolve implementation challenges (in particular, straight
inefficiency hassles displayed with many-language deployments of
method (1)). A simple combined technique may be to use (2) to limit
searched dictionary languages to languages existing in the unicode
codeblock specified, then method (1) to perform language detection.

Other techniques such as statistical analysis are also possible,
though far beyond the scope of LyX. Even a database matching
dictionary languages to unicode blocks, however, almost certainly does
not belong in LyX. (Bloat, "do one thing and do it well", KISS, etc.)
 Thus in my original email (should you be able to find the time to
read it!) I suggested a prescriptive UI whereby the user is able to
define their own expectations of which languages exist per unicode
block within their document.  This also provides the side benefit of
efficiency, since 'primary language' (the default language to which
which input in that unicode block automatically sets language markup)
could also be 'primary dictionary' for language detection (eg: when
pasting large blocks of text, opening legacy files, falling back to
(1) from (2) as described above, or whatever...).

Finally, some left-field techniques may also be available, like making
API calls out to Google Translate for 'outsourced' language detection!

>>>> 6. Dictionary Re-Use Support [*]
>>>>   -----------------------------
>>>>   Another point is that of re-use.  Which is to say that, when someone uses
>>>>   for example 'BibTeX' to compile a biliographic database, that database
>>>>   may easily be used with other projects and is considered portable.  So
>>>>   for all physics papers I can use one bibliography, and I may have another
>>>>   for history papers.  Whilst this is presently handled adequately by LyX,
>>>>   the equivalent functionality is not present for dictionary databases.
>>>>   It should be.  This means both adding a 'manage multiple dictionaries in
>>>>   this project' feature-set, and adding a 'which dictionary do you want to
>>>>   add the word to' drop-down in the right hand spellchecker sidebar.
>>>
>>> This is a good idea (already mentioned on developers list, AFAICR).
>>> The idea is to incorporate a personal dictionary into the document.
>>> But it definitively will not happen tomorrow.
>>
>> Great, as long as the personal dictionary "in" the document is saved "outside"
>> the document and as a file that can be:
>> a) shared between multiple documents
>> b) used with zero or more additional personal dictionaries within a single file
>> c) identified as the target dictionary (vs. other active personal
>> dictionaries) when spell-checking the document and adding words to
>> personal dictionaries
>
> The externally saved personal dictionary is shared between multiple documents per se.

I am unclear on the suggested implementation, however any assumption that
all documents a user is editing should use the same personal
dictionary is flawed.
From your next comment, I think you see this...

> My idea was to provide the dictionary "inside" the document as an alternative.
> If you are sure a word is correctly spelled you want to transfer this know how
> with the document.

This is a good idea however it is not nearly as flexible as the established and
proven BiBTeX approach, ie: the storage of dictionaries in external dictionary
files that can be linked to documents on a many-to-many basis. This
approach facilitates control for the user ("my physics dictionary", "my
history dictionary", "my computer science dictionary") without making
unnecessary assumptions about the similarity of ALL documents that a
user edits or the number of such domains in which a user actively writes.

> To have multiple private dictionaries is a third option with - like the second one -
> a much higher demand on the usability of the spell checker interface.

This is my suggestion and the only way I can see LyX providing a
competent and future-proof solution for spell checking.

Did your research reveal which interface(s) did or did not support
this approach?

This seems a legitimate need that should be pushed upstream if unavailable
in hunspell and enchant.  Enchant seems the sort of project where
implementing a costly emulation layer against backends missing this
support may occur "for free", at least in terms of the LyX project's developer
time, if a request is acknowledged.  The motivation for enchant's developers
is that this would further assist the enchant project to differentiate enchant's
API/ABI from hunspell's API/ABI in the eyes of application developers,
since hunspell seems to be rapidly subsuming prior systems and value of
the enchant layer is therefore decreasing for some development scenarios.

In addition, enchant probably already has code to read/write external
dictionary files of the sort suggested ("my <x-domain> dictionary"...) and
could expose an API/ABI for doing so that LyX could easily utilise. Any
functionality identified as missing may be enthusiastically implemented
by the enchant developers.

Hell, they might even extend their scope slightly to throw in some
critical functions for language detection!  (Indeed, on the fact of it it would
seem that there is little point in adding any such code to LyX vs. enchant,
a cross-platform library that already provides concurrent access to
multiple spellchecking engine's dictionaries.  Do one thing and do it well,
the KISS principle?)

Sorry I do not have time to go back and snip all of this, I have been typing
and researching the above since before the dawn (sunrise over Los
Angeles is surprisingly beautiful!) and am now running late for work!

Very keen to hear thoughts...

- Walter
Received on Mon, 31 Jan 2011 12:20:21 -0800

This archive was generated by hypermail 2.1.8 : Mon Jan 31 2011 - 21:20:32 CET