i18n
John Chambers
jc at trillian.mit.edu
Sun Mar 19 11:25:38 EST 2006
#To: <discuss-bounces at blu.org> (Robert La Ferla)
|
| BTW - One of the most powerful features of Mac OS X (and it's ancestor
| NextStep) is the NSText class (and related classes) in the
| ApplicationKit API. The NSText class is what makes OS X applications
| easy to localize. It supports East Asian, Arabic/Hebrew (right to left
| scripts), etc... natively. It was a monster to program but the results
| are fantastic. Nearly all OS X apps use it for everything from simple
| text fields (shared text object) or whole documents. As a result, you
| can enter text in any language nearly anywhere in an app and copy/paste
| it to other apps, print it, etc...
Hmmm ... You must have a very different instantiation of OS X than
what's on my wife's and my Powerbooks. Shelley and I have done a fair
bit of experimenting with text in languages with non-Roman charsets,
and our experience is very different. I can hardly find any apps that
correctly implement copy/paste for anything but English text. Most of
them produce gibberish for non-Latin1 charsets at least part of the
time. They can hardly even handle simple Cyrillic text sanely.
As a test case, I recently set up a stress-test example in my online
music collection:
http://trillian.mit.edu/~jc/music/abc/China/GIS_4T_D_W.utf8.abc
http://trillian.mit.edu/~jc/music/abc/China/GIS_4T_D_W.utf8.txt
http://trillian.mit.edu/~jc/music/abc/China/GIS_4T_D_W.utf8.pdf
http://trillian.mit.edu/~jc/music/abc/China/GIS_4T_D_W.utf8.ps
The .abc and .txt files are the same file, with different MIME types.
If you look at the .txt file in firefox, you'll see several versions
of the title, in the original Chinese, in a Pinyin transliteration
(note the U-umlaut in the first word), in English, and in Arabic.
With firefox, all are displayed correctly. I can also cat this file
in a Mac Terminal window, and it the titles render correctly. I can
ssh to this FreeBSD system or to my linux box and cat the file, and
they come out correct. I can also ssh from an xterm on my linux box
th the other machines, cat the file, and it's fine. This seems to
prove that Unicode fonts are installed, the terminal emulators
understand UTF-8, and none of the software in the sss+tcsh+cat chain
break anything.
With Safari, which is an Apple browser, the Arabic is rendered badly.
The right-left order is ok, but the letters are all initial forms,
and they aren't connected. This isn't acceptable (in civilized Arabic
society ;-). Getting the letter forms wrong makes the text very
difficult to read.
The .ps file is an interesting case that fails badly with all the PS
renderers that I have. Download it to disk and run the command:
: head -1115 GIS_4T_D_W.utf8.ps | tail -20
You'll see the PS versions of the titles, and if your Mac or linux
box is like mine, the Chinese, Pinyin and Arabic titles will all be
correct, if not very aesthetic. This proves that the abc2ps
translator got the titles correct (and that the terminal emulator can
handle the UTF-8 encoding when it isn't explicitly labelled as such).
The .pdf file was derived from the .ps file by feeding it to ps2pdf,
which comes with FreeBsd and linux; there's a pstopdf on Macs that
seems similar. I don't know how to verify that the .pdf file has the
titles correct. But try downloading either the .ps or .pdf file via a
browser. I just did it with Safari, which renders PDF inside a
window. The three non-English titles are all trashed. The Chinese and
Arabic are Latin-1 gibberish.
If you download them with firefox, they get fed to various renderers.
I've got them on my Mac's screen with both Preview and Acrobat, and
both show the same Latin-1 gibberish as does Safari. So they're all
making similar mistakes.
The evidence for what has gone wrong is in the Pinyin title, where
the second letter, u-umlaut, comes out as A1/4, with a tilde above
the A. So what they're apparently doing is interpreting the charset
as Latin-1 (ISO 8859-1). Why would this be? Well, the main way that
files downloaded via HTTP is from the HTTP headers. Here are the
headers that this machine's apache server delivered to me:
HTTP/1.1 200 OK
Date: Sun, 19 Mar 2006 14:46:27 GMT
Server: Apache/1.3.34 (Unix)
Last-Modified: Sat, 11 Mar 2006 01:07:51 GMT
ETag: "321f4b-4f1-441222e7"
Accept-Ranges: bytes
Content-Length: 1265
Connection: close
Content-Type: text/vnd.abc; charset=utf-8
We can see here that the content is clearly labelled "charset=utf-8",
as we'd expect from the ".utf8." in the file name. So the browsers
know that the text is UTF-8. But all the attempts to display the .ps
file, whether via Preview, Acrobat, or the Safari browser, all garble
the titles the same way.
Anyway, I'm not too impressed by all this. It's not just that Apple
gets it so wrong; so do Adobe and Mozilla programs. What's really the
annoying part is that the Apple crowd just keeps chanting "It Just
Works", even when I post examples like this and try to get them to
exlain why it fails so badly on our Powerbooks.
My wife has been using the Middle-East news as an excuse to improve
her Arabic, and she does things like reads Al Jazeera's Arabic pages.
There are also a lot of local blogs in Arabic, only a few of which
are also in English. Interesting stuff. But there's an ongoing
frustration with getting software to work right. She also has a
Windows box. It does some things right that her Mac messes up, but
the MS software also garbles some other stuff that is good on the
Mac. Both can be summarized as "not quite there and frustrating as
all hell".
One of my motives here is to extend this musical example. I'd like to
have music files like this that mix languages, not just in the
titles, but also in the lyrics. Getting even the simplest examples to
display right is an ongoing nightmare. Thus, I have a lot of songs in
Hebrew and Yiddish, which aren't quite the typographical nightmare
that Arabic is, but have similar problems. I used Arabic in the above
song title simply because it's known to be the worst case, so it's a
slightly better test case than Hebrew would have been.
BTW, Textedit seems to work fine on Macs with mutiple languages,
other that the usual problem of getting the charset right at the
start. But copying from a Textedit window to other windows fails so
often that I'm often surprised when it works correctly.
It'd be interesting to see this example work on linux. I've been
considering testing Ubuntu, since it's aimed at exactly this sort of
multilingual user population. But it would be interesting to find
some good info on how to do such things right on any system. And Macs
are nice in some ways; it's too bad that the ad claims fall down so
badly.
--
_,
O John Chambers
<:#/> <jc at trillian.mit.edu>
+ <jc1742 at gmail.com>
/#\ in Waltham, Massachusetts, USA, Earth
| |
More information about the Discuss
mailing list