design -- multiple languages in the same document

Paul Rohr (paul@abisource.com)
Tue, 09 Nov 1999 11:57:01 -0800


At 03:48 PM 11/9/99 -0000, Caolan McNamara wrote:
>It has come to my attention that there are many word documents that
>use 8 bit characters. And use the windows codepages with them. To
>support this kind of stuff I intend to add another variable to
>the charhandler function named "lid". The lid will be ms word's
>language identifer of the character involved.
>
>At all times you can use this to determine what language the text
>is in. This is a bit of an idea for abiword for spell checkers, to have
>part of the document in "english" and the rest in "german" and have
>different spell checkers for each region.

Does the Word file format always indicate the content language, or only when
the charset varies? (I thought I remembered a finer-grained UI for
indicating specific languages within a charset, but it's been a while.) In
any event, the charset alone sounds like a good cue, at least to get started
with.

I should point out that this feature has definitely *not* shown up on
anyone's 1.0 feature list, but I'm sure it will eventually be coded.

If someone's interested in adding more sophisticated support for content in
multiple languages, be aware that the feature involves work on a number of
levels.

file format
===========
To begin with, our file format will need a mechanism like HTML's for
indicating the "lang" of a given span.

http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1

CSS uses a pseudo-class for this purpose, presumably because LANG is a more
permanent structural attribute of the content, rather than a format that
could change:

http://www.w3.org/TR/REC-CSS2/selector.html#lang

Our usual approach would be to add a character-level property for this, but
I suspect we should use an attribute instead. Mechanically, this is pretty
easy, but you'll have to check that the inheritance works as expected.

(This also implies that you can't set the LANG using styles, but that's
probably a Good Thing.)

importers
=========
Once we decide what attribute/property to use, each importer will need to
figure out when and how to set it properly. Caolan's indicated how wv can
do this, and I'm confident that RTF, etc. will have a similar mechanism of
some sort.

UI stuff -- allow the user to change the LANG of a portion of the document
==========================================================================
Conceptually, this should work like any of our existing formatting
operations. Drag out a selection and then "apply" a specific language to
it. The trick is to come up with an efficient UI for selecting which
language/locale to apply. This could be a dialog, I suppose, or a
pull-right menu. IIRC, Word has a drop-down combo button or something
listing the flags(?) of frequently-used languages.

UI stuff -- set the default LANG for a document
===============================================
By comparison, this one's easy. It's definitely a dialog, or a top-level
menu pullright.

general -- identify languages by name
=====================================
The prefs mechanism allows you to change the UI language, and now we're
talking about allowing the content language to change as well. As these
features make their way into the UI (instead of being hidden behoind the
scenes), we'll need to start using a systematic way of tagging and
presenting language-specific resources (such as UI translations,
dictionaries, help systems, etc.).

Conceptually, this is pretty simple. Each such resource needs two
additional pieces of metadata associated with it:

- the canonical one the program will recognize [en-us, de]
- a localized, human-readable one [English (United States), Deutsch]

To minimize the n-way translation headache, note that the latter name only
needs to make sense to speakers of that language, otherwise the labelled
resource won't do you much good.

In other words, the following list of available languages would always be
translated properly, no matter what language your UI is currently set to:

Deutsch
English (United States)
Français

However, the prompt introducing the list would change. :-)

spelling -- identify dictionary languages
=========================================
Currently, we blindly load an ispell dictionary called american.hash (or
portuges.hash for that matter) and just use it. If we're going to allow
folks to use multiple such dictionaries, we really ought to make sure that
each has those two additional pieces of metadata associated with it.

Likewise, we should change the spelling dialog title to reflect the language
currently being checked.

spelling -- dynamically load dictionaries
=========================================
Currently, we only load one dictionary at a time, which is a good thing,
since they're *big*. More work will need to be done so that we can
demand-load dictionaries as needed. Ditto for unloading them when memory
gets tight.

spelling -- switch off LANG
===========================
Changing the spell check algorithms to honor the LANG attribute shouldn't be
that hard. We might want to do a bit of extra work to avoid nasty
performance hits though, since LANGs should change rarely, if at all in a
given document.

bottom line
===========
My goal isn't to scare anyone off. This is a very cool feature that we
definitely want to include in the product. Someday. :-)

Like any fundamental feature, it touches bits and pieces of code throughout
the source tree. Indeed, it'll be a lot easier than other, more localized,
features, such as tables or change bars. :-)

Caveat hacker.

Paul



This archive was generated by hypermail 1.03b2.