Re: a thought -- fonts get us a long, long way

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Tue Apr 30 2002 - 00:47:41 EDT

  • Next message: Robert Wilhelm: "Re: FreeType and patents [Re: further notes on using Pango]"

     --- Paul Rohr <paul@abisource.com> wrote: > One
    offshoot of the whole i18n/Pango discussion
    > recently is that it finally
    > dawned on me just how powerful our *existing*
    > Unicode support in 1.0 already
    > is -- without BiDi or Pango.
    >
    > Provided that users can locate appropriate fonts,
    > that is.
    >
    > It might be helpful to segregate the languages we
    > support into the following
    > broad categories:
    >
    > 1. easy
    > 2. easy, with the right font
    > 3. bidi
    > 4. complex shaping required (including combining
    > characters)
    >
    > As the World.abw test document demonstrates, there
    > are a *lot* of languages
    > which fall into the first two categories.
    >
    > the "just fonts" languages
    > --------------------------
    > Not only are there thirty-some Latin-1 languages
    > which definitely fall into
    > the first category (most fonts support them), but
    > some of the small,
    > general-purpose Unicode fonts being deployed add
    > "just enough" glyphs to
    > support an even broader range of languages.
    >
    >
    >
    http://www.abisource.com/mailinglists/abiword-dev/02/Apr/1036.html
    >
    > Indeed, after doing some more digging, we can
    > support content in many more
    > languages by just locating a font that includes
    > enough glyphs in the
    > appropriate Unicode range.
    >
    > http://www.alanwood.net/unicode/fonts.html
    >
    > For example, the government of Nunavut has recently
    > created Unicode fonts
    > for Inuktitut:
    >
    > http://www.assembly.nu.ca/unicode/fonts/
    >
    >
    http://www.assembly.nu.ca/unicode/fonts/beginner.html
    >
    > I can't read them, of course, but they sure look
    > pretty. :-)

    The "just fonts" languages include:
    Latin-based languages:
    Czech, Polish, Hungarian, Esperanto, Maltese, ...
    Cyrillic-based languages:
    Russian, Ukrainian, Serbian, ...
    Probably the ideographic languages:
    Chinese (traditional & simplified), Japanese, Korean.
    Greek.
    Georgian.
    Armenian.
    Ethiopian (Amharic).

    > the "harder" languages
    > ----------------------
    > Of course, there *are* languages for which we'll
    > need more than just fonts.
    > For example, Tomas has hand-coded a lot of support
    > for bidi languages, a
    > category which includes:
    >
    > ar, fa, he, ur, yi

    Those are the world's major RTL languages. The next
    most important one is probably Syriac.

    > Now we're investigating Pango since, in addition to
    > BiDi support, it should
    > (eventually) encapsulate knowledge about the more
    > complex typographic needs
    > of languages which don't have discrete Unicode
    > codepoints for all of the
    > glyphs needed. Andrew keeps mentioning Vietnamese
    > (vi-VN), and I know that
    > other South Asian languages need this, but how
    > extensive is the rest of this
    > category?

    Vietnamese has all the codepoints but not many fonts
    have the glyphs for the precomposed codepoints. I
    believe they're deprecated and this was about the last
    language where Unicode decided to add precomposed
    characters.

    Languages which depend on combining characters are:
    Thai and Lao. Thai has a long history of working
    on computers and Lao is very close to being a font
    replacement for Thai. Basically if you can support
    one you can support both.

    The final category is "Complex Scripts" where the
    shaper has to do a lot of work. No attempt has been
    made to create all the hundreds of codepoints that
    would be needed for a complete set of precomposed
    characters.

    These include all the Indic scripts:
    Devanagari: Hindi, Sanskrit, Marathi, Nepali.
    Bengali: Bengali, Assamese.
    Tamil.
    Telugu.
    Kannada.
    Malayalam.
    Gurmukhi.
    Gujarati.
    Oriya.

    All these scripts are based on ISCII so when you can
    support one you can pretty much support all.

    Then there are newer Indic scripts that were not part
    of ISCII and have been developed by Unicode. They
    usually have a few extra issues which didn't quite
    fit into the ISCII pattern:

    Khmer (Cambodian).
    Myanmar (Burmese).
    Sinhala.
    Tibetan.

    Finally there is one very quirky script that we
    probably should pretend we don't know about at this
    point:

    Mongolian. It's a vertical cursive script which
    looks a lot like Arabic but is written from top to
    bottom on the page in columns progressing from left
    to right! Pango is planning to support it though...

    > the question
    > ------------
    > OK, i18n experts ... is this a useful, clean
    > distinction? If not, please
    > let me know what I've garbled here.

    The distinction between "easy" and "just fonts"
    pretty much disappears when we're using mostly
    TrueType fonts encoded in more than 8 bits. Most
    Windows fonts cover a lot more than just Western
    Europe. This is really an 8-bit X font issue.

    Vietnamese is easier than Thai et al which is easier
    than Devanagari et al which may or not be easier than
    Khmer et al.

    > bottom line
    > -----------
    > I'm thrilled that we've got dedicated folks working
    > on solving the "harder"
    > language problems. However, I'd love to see some
    > folks do more research on
    > improving our support for "just fonts" languages as
    > follows:
    >
    > - come up with a complete list of such languages
    > - come up with a list of the fonts needed to
    > support each of them

    This means languages which do not require "zero width"
    characters, "combining characters", "visual order
    different to logical order", positional variation,
    ligatures. It also pretty much means those which
    don't literally require combining characters but in
    reality do - like Vietnamese. Languages with very
    large sets of characters or very strange looking
    alphabets are still usually "just fonts" languages.

    Most of the information can be found on the Unicode
    site. Well for the languages but maybe not for
    fonts. Multilingual font listings are maintained by
    a few people around the web. I can recommend two free
    fonts which cover a huge amount of Unicode but are
    not very high quality (yet): Code2000 and Code2001.

    > Note that this is essentially a web research task,
    > not a coding task. The
    > ultimate goal would be to learn enough so that we
    > could write a quick
    > website entry for each language, telling users:
    >
    > - who's responsible for the translation
    > - where to find dictionaries (if any)
    > - where to find fonts
    > - etc.
    >
    > For example, two sample entries might be
    >
    > Indonesian (id-ID)
    > ------------------
    > translators: Tim Allen, ...
    > dictionary: (n/a)
    > fonts: ...
    > sample: (the UTF-8 gobbledygook from
    > World.abw)
    > picture: (screenshot of the same)

    Indonesian uses the English alphabet with no accents
    so any English font will work.

    > Inuktitut (iu-CA)
    > -----------------
    > translators: (n/a)
    > dictionary: (n/a)
    > fonts:
    > http://www.assembly.nu.ca/unicode/fonts/
    > sample: (the UTF-8 gobbledygook from
    > World.abw)
    > picture: (screenshot of the same)
    >
    > Best of all, this could increase our language
    > support for the 1.0.* series
    > of products, while waiting for all the hard coding
    > work to get done for the
    > set of other languages which actually *do* need BiDi
    > and/or Pango.

    I'm really keen on us doing some "language
    evangelizing" too. We could become the #1 word
    processor for translators, linguists, phrasebooks,
    bilingual dictionaries, language revival, language
    courses, dictionary compilation. We could be used
    by SBS television in Australia, we could be used by
    the UN!

    Okay I'll settle down now (:

    > Does this sound interesting? Is anyone interested
    > in coordinating such an
    > effort? It seems like a large task to write up as a
    > uPOW.

    (usual disclaimer about needing a machine)

    Andrew Dunbar.

    > Paul

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Tue Apr 30 2002 - 00:49:42 EDT