Japanese characters on OO.o presentation--> i18n
Nicholas Bodley
nbodley at speakeasy.net
Thu Mar 16 22:28:27 EST 2006
{Long message; sorry, no online repository I can give a link to.}
(What's "i18n"? Answer: A quick way to type "internationalization", if you
don't have auto-completion (like tab-completion) for routine text entry;
most people don't. That's a 20-letter word.)
Basically, entering Japanese is one specific aspect of a much-more-general
topic, which is based on the fact that people who don't use our
really-simple* 26-letter alphabet truly *do* often want to read and type
in their own language and writing system, when using their computers.
Can't blame them.
*Except for caps and small letters, that is; many alphabets and
alphabet-like writing systems don't have caps.
I need to do a lot of study and "digging" for i18n in Linux; I want an OS
(and e-mail composer) that can insert any Unicode character, and render
most major writing systems. Right now, Libranet (nice Debian derivative)
is very disappointing (ASCII only!) in its default configuration for
e-mail composition in Opera (It's not Opera's fault, just about sure.)
I'll be migrating to a newer machine RSN, and will really start to be
serious about i18n when I do that. So far (I havent really tried),
Libranet renders most writing systems just fine. (Yes, I do know about
Yudit, although I tend to forget about it; shame.)
*L18nux, anyone? Only ~90 Google hits!
Opera's e-mail composer in Win is not even a text editor (it's pathetic),
but has one endearing feature: Enter the "U+nnnn" hex code for an Unicode
character, do an Alt-X, and, if it's in your fonts, it will be rendered,
in-line. Combine that with MS Win's Alt-nnnn scheme for entering 8-bit
characters in the current encoding, and I'm almost as happy as a clam in
unpolluted wetland.
Essentially, what follows is a bunch of thoughts, trying to stick to
what's important, about i18n; it could help introduce the topic. At the
end, I have a reply to Robert. Corrections are welcome! I'm an amateur and
dilettante about this topic.
===
Being the son of a father born in (Tsarist!) Russia, and first-gen.
American (English) mother, perhaps I'm a bit more of an internationalist
(I didn't count the letters :) ) than many with native-born, say,
great-grandparents or more.
Around a decade ago, when I tried Xdenu Linux (command-line only!) (and
Debian, also command-line, a few years later), but using a Moss Doss
6.22machine, I was active on the Early Music* mailing list, mirrored to
rec.music.early (iirc!) and was horrified to see badly-munged European
names; it was Moss Doss and Codepage 437. After researching matters, I
learned about the Codepage 850 series, installed Codepage 850, and was
further horrified to see different munging.
*As in Handel, say, not early Beatles (which I love, too)
Fellow known as Kosta Kostis, Greek heritage, German resident (and
probably German citizen) hated seeing letters with umlauts (pairs of dots
over a, o, and u) and the German [)Bß] badly munged in DOS. He found out
about Codepage 819, which is identical in its encoding and (implied?
Actual?) character set with Latin-1, a.k.a. ISO-8859-1. Because DOS
commands have hidden sanity checks to reject all but a small subset of
probably several hundred codepages, he wrote his own replacement commands,
and packaged them along with a set of codepages corresponding to
ISO-8859-[n] for [n] up to 10, iirc.
I got his .zip file, installed it, fooled around a bit, but finally
installed all 10 or so. I still have a nice HP Vectra 386-16/N DOS machine
that's '8859 compatible; can display Turkish, Icelandic, or Polish, but
simply not all at once. The nice box-drawing characters do get lost, a
tradeoffI was willing to live with.
Alan Flavell <http://ppewww.ph.gla.ac.uk/~flavell/> (home page) was the
first person whose online text clued me in to i18n and Codepage 819.
Others I remember were Jukka Korpela (really good), Markus Kuhn, and Roman
Czyborra. I became an Unicode hobbyist (no kidding). Studying Unicode, and
studying about it, reawakened an interest in writing systems.
While Unicode is not a divine gift (it's not universally loved, apparently
especially in Asia), it's the most-widely-accepted way, and quite a good
one, of making a computer work with other writing systems. (Btw,
"writing", in this context, refers to typing, typesetting, printing, font
design, etc.; it's far more that just handwriting, which is a small
subset. The Yahoo Group Qalam (reed/pen in Arabic) is a mailing list for
those really interested in the topic.)
I politely harangued Opera software a few years ago, and might possibly
have had a little influence in making their browser much more
internationally-compatible, and sooner.
I18n is a huge topic, and here in the USA, for reasons of cultural
history, geographical isolation*, and sheer size, we are mostly a
monolingual nation with by far the simplest major writing system there is.
Malaysian/Indonesian and a very few other less-widely-spoken languages are
the only others besides English that don't add diacritical marks routinely
to some of their letters. *But, consider Canadian French near the Quebec
border, and Spanish as our national second language, almost...
When rendering (to screen or paper), we can simply take each byte of text
and place it to the right of the previous one, accounting for line breaks
either in flowed format (word-processor, and up-to date e-mail) or
explicitly. Life, for us, is simple.
The various ISO character sets worked well for limited ranges of
languages, but for a truly global setup, Unicode is the way to go (unless
you're in parts of Asia, afaik. :) )
The text at the beginning of the Unicode manual* is a capsule introduction
to i18n as it affects computer text, including text preparation, storage,
rendering, and sorting (collating?) sequences. It's not a general
introduction ot writing systems, although it can be a shock at times to
read about their essentials. *online in PDF form (very nice, too), but not
as one huge file, afaik! <www.unicode.org> should get you started.
In general, 16 bits per character works well, and the Unicode people do
value common sense a lot. The lowest 7 bits of any 16-bit character will
give you ASCII, regardless. Fun begins at the 8th bit, and ends only with
extensions beyond 16 bits, which the Unicode people were sensible enough
to provide for.
One surprise is that quite a few important writing systems (in India, in
particular) don't, strictly speaking, have true alphabets, although a
casual look at their Unicode charts would make you think so. (They are
technically abjads or abugidas, but I have had little luck committing to
memory the definitions of those terms.) Each standalone character, just
about always, has an inherent vowel, apparently just about always like our
"ah". It's as if we had no vowels, and all our regular letters were called
"ba, ca, da, fa, ga, ha, ja, ka, etc.". (The name "Jalal Talabani" would
be quite concise, in such as system!) There are ways to write standalone
consonants, using what are called (slang) "vowel killers", and also ways
to write standalone vowels, of course.
Then, one also learns that one cannot simply take the next byte, look up
its bitmap, and render it. Many writing systems require rather-complicated
schemes for analyzing byte strings locally and creating acceptable bitmap
images.
Arabic and Hebrew, of course, are written right-to-left (RtoL), and some
real fun begins when you mix them with LtoR scripts (script: One word for
"writing system"). Consider Arabic text, flowed format, with a fairly-long
English quote in-line. Line breaks when rendering? Major change in column
width? Editing the quote, maybe extending it? Unicode handles this (and
probably all other matters concerning mixed-direction text) with the BiDi
algorithm. (That's "bidirectional", of course, not the small-cigar-like
bidis of India).
Arabic presents special challenges. Scientific American had an excellent
article about its use in computers roughly 15 years (?) ago. Arabic *has*
to be rendered essentially as connected script, much like our handwritten
text where letters are joined. Arabic rendered with standalone letters
looks quite bad (much worse, afaik, than all caps in e-mail), and is
probably a bear to read. (I don't know Arabic, only about it, in general.)
Many letters in Arabic have no fewer than four forms, a good number of
which are not recognizable as the same letter to unschooled eyes. (Those
four forms are initial, medial, final, and standalone. Initial starts a
word, medial is in the middle, and final ends a word, just to clarify.)
Arabic letters also need to be joined, and the joining seems to be
non-trivial. The whole process of rendering byte strings of text includes
what are called "shaping and joining", and it's language-specific. At
least in Win 9x, likely 2K, and probably XP, a DLL named Uniscribe (iirc!
-- several versions exist; I'm using usp10.dll in 98 SE) takes care of
these details for many major writing systems.
Concerning the dozen or so major writing systems of India, most
otherwise-universal software that's been internationalized still can't
render them properly. Microsoft had some business arrangement (perhaps
ownership) with a company in India that is doing something serious about
the situation, however. I'm surely no MS fan, but they have done a lot of
good, imho, for i18n and international computer typography. Their Arial
Unicode font (no longer available) was (imho) a tremendous boon to the
computer community, for one. The Typography section of their Web pages has
been laudably un-commercial. (While on the topic, my Libranet seems to
render many writing systems quite nicely. I'm very pleased about that. I
did import fonts from 98 SE.)
What it amounts to is that acceptable rendering of most of the world's
writing systems requires writing-system-specific software to look
acceptable.
Furthermore, entering text in systems like Amharic (Ethiopian), Korean, or
Chinese and Japanese, all of which have at least a few hundred rendered
characters, using the keyboard we are accustomed to, requires what seem to
be called "input method"* software that acts as an intermediary between
keyboard and the end application.
*MS term: Input Method Editor, IME
Just today, I had a look at Mandriva 2K6 Live, and was happily surprised
to see the variety of keyboard layouts. (Hey! Dvoraks in Scandinavia!
There's hope, yet!) Can't say that without noting install language
choices. Thought I knew the names of most significant languages; not so.
(Some obscure ones are probably not "significant", but to their speakers,
they are!
A final thought: Thai does not use word spaces.
Itstextisruntogetherlikethis. Afaik, such a simple matter as breaking
lines of text properly* requires dictionary lookup!
*I've seen hyphenated line breaks in recently-composed English-language
texts that make me want to lose my last meal...
AT LAST:
¯¯¯¯¯¯¯¯¯¯¯¯
On Thu, 16 Mar 2006 14:17:16 -0500, Robert La Ferla
<robertlaferla at comcast.net> wrote:
> BTW - When sending e-mail in Japanese, use ISO-2022-JP for your encoding
> to avoid complaints about mojibake.
Imho, read and heed! I didn't know that. I'm extremely unlikely to send
e-mail in Japanese, but it's one of those essentials (like knowledge of
BCC) one really has to keep in mind when sending e-mail.
As I understand it, (and I might well be wrong! Corrections welcome!)
there are at least two basically-different ways to encode Japanese text;
iirc, one (Shift-JIS? Apologies if I'm wrong) is something like the old
{ltrs}/{figs} shift in 5-bit teleprinters -- one can be in the wrong mode.
The consequence is that if a "mode-change" character is omitted, or
wrongly sent when it should not be, (or munged...), all subsequent text
(at least up to a redefining of "mode") is scrambled badly. If you think
seeing English text in {figs} shift is bad, when you have a practical set
of something like 2,300 or so basically-Chinese characters, and are
receiving nonsense, as I understand it, that's mojibake.
[Katakana]
One can read more Japanese than one might, at first, expect. Japan has
imported English words "wholesale", sometimes adapting them to their own
language (I'm typing on a Compaq "pasokon" -- pasonaru konpyuutaa).
Perhaps 35,000 words have been imported. These words are rendered/written
with a simple syllabary called katakana, which (except for
arbitrary-seeming, never-complicated character shapes) is about as easy to
learn* as an alphabet, and can be a *lot* of fun. By no means whatsover is
katakana anywhere near as difficult to learn as kanji (Japanese name for
Chinese characters; kan = China, ji = characters). As to that "arbitrary",
katakana started life as sometimes-complicated Chinese characters, used
purely for phonetic purposes. I could go on, and on...
Knowing katakana can be useful; if you learn it, or at least have a
character chart*, you can read part of Japanese text. (Btw, in all,
Japanese uses no fewer than four character sets in routine text, if you
include romaji (roman/latin letters) as part of their writing system,
which they are. Four is unique; it's max. of any.) *The character chart
alone isn't quite sufficient, but helps considerably. Try Tuttle,
publisher, for a small book about katakana for people going to Japan.
(Just in case anyone checks, this is being sent via the Delicate Flower
(skunk-cabbage flower?) OS, the one that never seems to run out of
unimaginable ways to crash; however, Opera e-mail in that OS is *vastly*
better for i18n than straight Libranet*! I'm currently using both. These
days, at the beginning of a session, I am deciding which to boot first.)
*Very nice Debian derivative, but likely to fade from the scene.)
My regards to all,
--
Nicholas Bodley [{(<>)}] Waltham, Mass.
Midnight hacker (approved by management)
in 1960, on an all-NAND-gate machine with
a 19-bit word length; paper tape code was
duotricenary (radix-32).
More information about the Discuss
mailing list