Japanese characters on OO.o presentation--> i18n

Thu Mar 16 22:28:27 EST 2006

{Long message; sorry, no online repository I can give a link to.}

(What's "i18n"? Answer: A quick way to type "internationalization", if you  
don't have auto-completion (like tab-completion) for routine text entry;  
most people don't. That's a 20-letter word.)

Basically, entering Japanese is one specific aspect of a much-more-general  
topic, which is based on the fact that people who don't use our  
really-simple* 26-letter alphabet truly *do* often want to read and type  
in their own language and writing system, when using their computers.  
Can't blame them.
*Except for caps and small letters, that is; many alphabets and  
alphabet-like writing systems don't have caps.

I need to do a lot of study and "digging" for i18n in Linux; I want an OS  
(and e-mail composer) that can insert any Unicode character, and render  
most major writing systems. Right now, Libranet (nice Debian derivative)  
is very disappointing (ASCII only!) in its default configuration for  
e-mail composition in Opera (It's not Opera's fault, just about sure.)  
I'll be migrating to a newer machine RSN, and will really start to be  
serious about i18n when I do that. So far (I havent really tried),  
Libranet renders most writing systems just fine. (Yes, I do know about  
Yudit, although I tend to forget about it; shame.)
*L18nux, anyone? Only ~90 Google hits!

Opera's e-mail composer in Win is not even a text editor (it's pathetic),  
but has one endearing feature: Enter the "U+nnnn" hex code for an Unicode  
character, do an Alt-X, and, if it's in your fonts, it will be rendered,  
in-line. Combine that with MS Win's Alt-nnnn scheme for entering 8-bit  
characters in the current encoding, and I'm almost as happy as a clam in  
unpolluted wetland.

Essentially, what follows is a bunch of thoughts, trying to stick to  
what's important, about i18n; it could help introduce the topic. At the  
end, I have a reply to Robert. Corrections are welcome! I'm an amateur and  
dilettante about this topic.

===

Being the son of a father born in (Tsarist!) Russia, and first-gen.  
American (English) mother, perhaps I'm a bit more of an internationalist  
(I didn't count the letters :) ) than many with native-born, say,  
great-grandparents or more.

Around a decade ago, when I tried Xdenu Linux (command-line only!) (and  
Debian, also command-line, a few years later), but using a Moss Doss  
6.22machine, I was active on the Early Music* mailing list, mirrored to  
rec.music.early (iirc!) and was horrified to see badly-munged European  
names; it was Moss Doss and Codepage 437. After researching matters, I  
learned about the Codepage 850 series, installed Codepage 850, and was  
further horrified to see different munging.
*As in Handel, say, not early Beatles (which I love, too)

Fellow known as Kosta Kostis, Greek heritage, German resident (and  
probably German citizen) hated seeing letters with umlauts (pairs of dots  
over a, o, and u) and the German [)Bß] badly munged in DOS. He found out  
about Codepage 819, which is identical in its encoding and (implied?  
Actual?) character set with Latin-1, a.k.a. ISO-8859-1. Because DOS  
commands have hidden sanity checks to reject all but a small subset of  
probably several hundred codepages, he wrote his own replacement commands,  
and packaged them along with a set of codepages corresponding to  
ISO-8859-[n] for [n] up to 10, iirc.

I got his .zip file, installed it, fooled around a  bit, but finally  
installed all 10 or so. I still have a nice HP Vectra 386-16/N DOS machine  
that's '8859 compatible; can display Turkish, Icelandic, or Polish, but  
simply not all at once. The nice box-drawing characters do get lost, a  
tradeoffI was willing to live with.

Alan Flavell <http://ppewww.ph.gla.ac.uk/~flavell/> (home page) was the  
first person whose online text clued me in to i18n and Codepage 819.  
Others I remember were Jukka Korpela (really good), Markus Kuhn, and Roman  
Czyborra. I became an Unicode hobbyist (no kidding). Studying Unicode, and  
studying about it, reawakened an interest in writing systems.

While Unicode is not a divine gift (it's not universally loved, apparently  
especially in Asia), it's the most-widely-accepted way, and quite a good  
one, of making a computer work with other writing systems. (Btw,  
"writing", in this context, refers to typing, typesetting, printing, font  
design, etc.; it's far more that just handwriting, which is a small  
subset. The Yahoo Group Qalam (reed/pen in Arabic) is a mailing list for  
those really interested in the topic.)

I politely harangued Opera software a few years ago, and might possibly  
have had a little influence in making their browser much more  
internationally-compatible, and sooner.

I18n is a huge topic, and here in the USA, for reasons of cultural  
history, geographical isolation*, and sheer size, we are mostly a  
monolingual nation with by far the simplest major writing system there is.  
Malaysian/Indonesian and a very few other less-widely-spoken languages are  
the only others besides English that don't add diacritical marks routinely  
to some of their letters. *But, consider Canadian French near the Quebec  
border, and Spanish as our national second language, almost...

When rendering (to screen or paper), we can simply take each byte of text  
and place it to the right of the previous one, accounting for line breaks  
either in flowed format (word-processor, and up-to date e-mail) or  
explicitly. Life, for us, is simple.

The various ISO character sets worked well for limited ranges of  
languages, but for a truly global setup, Unicode is the way to go (unless  
you're in parts of Asia, afaik. :) )

The text at the beginning of the Unicode manual* is a capsule introduction  
to i18n as it affects computer text, including text preparation, storage,  
rendering, and sorting (collating?) sequences. It's not a general  
introduction ot writing systems, although it can be a shock at times to  
read about their essentials. *online in PDF form (very nice, too), but not  
as one huge file, afaik! <www.unicode.org> should get you started.

In general, 16 bits per character works well, and the Unicode people do  
value common sense a lot. The lowest 7 bits of any 16-bit character will  
give you ASCII, regardless. Fun begins at the 8th bit, and ends only with  
extensions beyond 16 bits, which the Unicode people were sensible enough  
to provide for.

One surprise is that quite a few important writing systems (in India, in  
particular) don't, strictly speaking, have true alphabets, although a  
casual look at their Unicode charts would make you think so. (They are  
technically abjads or abugidas, but I have had little luck committing to  
memory the definitions of those terms.) Each standalone character, just  
about always, has an inherent vowel, apparently just about always like our  
"ah". It's as if we had no vowels, and all our regular letters were called  
"ba, ca, da, fa, ga, ha, ja, ka, etc.". (The name "Jalal Talabani" would  
be quite concise, in such as system!) There are ways to write standalone  
consonants, using what are called (slang) "vowel killers", and also ways  
to write standalone vowels, of course.

Then, one also learns that one cannot simply take the next byte, look up  
its bitmap, and render it. Many writing systems require rather-complicated  
schemes for analyzing byte strings locally and creating acceptable bitmap  
images.

Arabic and Hebrew, of course, are written right-to-left (RtoL), and some  
real fun begins when you mix them with LtoR scripts (script: One word for  
"writing system"). Consider Arabic text, flowed format, with a fairly-long  
English quote in-line. Line breaks when rendering? Major change in column  
width? Editing the quote, maybe extending it? Unicode handles this (and  
probably all other matters concerning mixed-direction text) with the BiDi  
algorithm. (That's "bidirectional", of course, not the small-cigar-like  
bidis of India).

Arabic presents special challenges. Scientific American had an excellent  
article about its use in computers roughly 15 years (?) ago. Arabic *has*  
to be rendered essentially as connected script, much like our handwritten  
text where letters are joined. Arabic rendered with standalone letters  
looks quite bad (much worse, afaik, than all caps in e-mail), and is  
probably a bear to read. (I don't know Arabic, only about it, in general.)

Many letters in Arabic have no fewer than four forms, a good number of  
which are not recognizable as the same letter to unschooled eyes. (Those  
four forms are initial, medial, final, and standalone. Initial starts a  
word, medial is in the middle, and final ends a word, just to clarify.)

Arabic letters also need to be joined, and the joining seems to be  
non-trivial. The whole process of rendering byte strings of text includes  
what are called "shaping and joining", and it's language-specific. At  
least in Win 9x, likely 2K, and probably XP, a DLL named Uniscribe (iirc!  
-- several versions exist; I'm using usp10.dll in 98 SE) takes care of  
these details for many major writing systems.

Concerning the dozen or so major writing systems of India, most  
otherwise-universal software that's been internationalized still can't  
render them properly. Microsoft had some business arrangement (perhaps  
ownership) with a company in India that is doing something serious about  
the situation, however. I'm surely no MS fan, but they have done a lot of  
good, imho, for i18n and international computer typography. Their Arial  
Unicode font (no longer available) was (imho) a tremendous boon to the  
computer community, for one. The Typography section of their Web pages has  
been laudably un-commercial. (While on the topic, my Libranet seems to  
render many writing systems quite nicely. I'm very pleased about that. I  
did import fonts from 98 SE.)

What it amounts to is that acceptable rendering of most of the world's  
writing systems requires writing-system-specific software to look  
acceptable.

Furthermore, entering text in systems like Amharic (Ethiopian), Korean, or  
Chinese and Japanese, all of which have at least a few hundred rendered   
characters, using the keyboard we are accustomed to, requires what seem to  
be called "input method"* software that acts as an intermediary between  
keyboard and the end application.
*MS term: Input Method Editor, IME

Just today, I had a look at Mandriva 2K6 Live, and was happily surprised  
to see the variety of keyboard layouts. (Hey! Dvoraks in Scandinavia!  
There's hope, yet!) Can't say that without noting install language  
choices. Thought I knew the names of most significant languages; not so.  
(Some obscure ones are probably not "significant", but to their speakers,  
they are!

A final thought: Thai does not use word spaces.  
Itstextisruntogetherlikethis. Afaik, such a simple matter as breaking  
lines of text properly* requires dictionary lookup!
*I've seen hyphenated line breaks in recently-composed English-language  
texts that make me want to lose my last meal...

   AT LAST:
¯¯¯¯¯¯¯¯¯¯¯¯
On Thu, 16 Mar 2006 14:17:16 -0500, Robert La Ferla  
<robertlaferla at comcast.net> wrote:

> BTW - When sending e-mail in Japanese, use ISO-2022-JP for your encoding  
> to avoid complaints about mojibake.

Imho, read and heed! I didn't know that. I'm extremely unlikely to send  
e-mail in Japanese, but it's one of those essentials (like knowledge of  
BCC) one really has to keep in mind when sending e-mail.

As I understand it, (and I might well be wrong! Corrections welcome!)  
there are at least two basically-different ways to encode Japanese text;  
iirc, one (Shift-JIS? Apologies if I'm wrong) is something like the old  
{ltrs}/{figs} shift in 5-bit teleprinters -- one can be in the wrong mode.  
The consequence is that if a "mode-change" character is omitted, or  
wrongly sent when it should not be, (or munged...), all subsequent text  
(at least up to a redefining of "mode") is scrambled badly. If you think  
seeing English text in {figs} shift is bad, when you have a practical set  
of something like 2,300 or so basically-Chinese characters, and are  
receiving nonsense, as I understand it, that's mojibake.

[Katakana]

One can read more Japanese than one might, at first, expect. Japan has  
imported English words "wholesale", sometimes adapting them to their own  
language (I'm typing on a Compaq "pasokon" -- pasonaru konpyuutaa).  
Perhaps 35,000 words have been imported. These words are rendered/written  
with a simple syllabary called katakana, which (except for  
arbitrary-seeming, never-complicated character shapes) is about as easy to  
learn* as an alphabet, and can be a *lot* of fun. By no means whatsover is  
katakana anywhere near as difficult to learn as kanji (Japanese name for  
Chinese characters; kan = China, ji = characters). As to that "arbitrary",  
katakana started life as sometimes-complicated Chinese characters, used  
purely for phonetic purposes. I could go on, and on...

Knowing katakana can be useful; if you learn it, or at least have a  
character chart*, you can read part of Japanese text. (Btw, in all,  
Japanese uses no fewer than four character sets in routine text, if you  
include romaji (roman/latin letters) as part of their writing system,  
which they are. Four is unique; it's max. of any.) *The character chart  
alone isn't quite sufficient, but helps considerably. Try Tuttle,  
publisher, for a small book about katakana for people going to Japan.

(Just in case anyone checks, this is being sent via the Delicate Flower  
(skunk-cabbage flower?) OS, the one that never seems to run out of  
unimaginable ways to crash; however, Opera e-mail in that OS is *vastly*  
better for i18n than straight Libranet*! I'm currently using both. These  
days, at the beginning of a session, I am deciding which to boot first.)  
*Very nice Debian derivative, but likely to fade from the scene.)

My regards to all,

-- 
Nicholas Bodley [{(<>)}] Waltham, Mass.
Midnight hacker (approved by management)
in 1960, on an all-NAND-gate machine with
a 19-bit word length; paper tape code was
duotricenary (radix-32).