i18n

John Chambers jc at trillian.mit.edu
Sun Mar 19 11:25:38 EST 2006


#To: <discuss-bounces at blu.org> (Robert La Ferla)
|
| BTW - One of the most powerful features of Mac OS X (and it's ancestor
| NextStep) is the NSText class (and related classes) in the
| ApplicationKit API.  The NSText class is what makes OS X applications
| easy to localize.  It supports East Asian, Arabic/Hebrew (right to left
| scripts), etc... natively.  It was a monster to program but the results
| are fantastic.  Nearly all OS X apps use it for everything from simple
| text fields (shared text object) or whole documents.  As a result, you
| can enter text in any language nearly anywhere in an app and copy/paste
| it to other apps, print it, etc...

Hmmm ...  You must have a very different instantiation of OS  X  than
what's on my wife's and my Powerbooks. Shelley and I have done a fair
bit of experimenting with text in languages with non-Roman  charsets,
and our experience is very different. I can hardly find any apps that
correctly implement copy/paste for anything but English text. Most of
them  produce  gibberish for non-Latin1 charsets at least part of the
time.  They can hardly even handle simple Cyrillic text sanely.

As a test case, I recently set up a stress-test example in my  online
music collection:

http://trillian.mit.edu/~jc/music/abc/China/GIS_4T_D_W.utf8.abc
http://trillian.mit.edu/~jc/music/abc/China/GIS_4T_D_W.utf8.txt
http://trillian.mit.edu/~jc/music/abc/China/GIS_4T_D_W.utf8.pdf
http://trillian.mit.edu/~jc/music/abc/China/GIS_4T_D_W.utf8.ps

The .abc and .txt files are the same file, with different MIME types.
If  you look at the .txt file in firefox, you'll see several versions
of the title, in the original Chinese, in  a  Pinyin  transliteration
(note  the  U-umlaut  in  the first word), in English, and in Arabic.
With firefox, all are displayed correctly.  I can also cat this  file
in  a Mac Terminal window, and it the titles render correctly.  I can
ssh to this FreeBSD system or to my linux box and cat the  file,  and
they  come out correct.  I can also ssh from an xterm on my linux box
th the other machines, cat the file, and it's fine.   This  seems  to
prove  that  Unicode  fonts  are  installed,  the  terminal emulators
understand UTF-8, and none of the software in the sss+tcsh+cat  chain
break anything.

With Safari, which is an Apple browser, the Arabic is rendered badly.
The  right-left  order  is ok, but the letters are all initial forms,
and they aren't connected. This isn't acceptable (in civilized Arabic
society  ;-).   Getting  the  letter  forms wrong makes the text very
difficult to read.

The .ps file is an interesting case that fails badly with all the  PS
renderers that I have.  Download it to disk and run the command:

: head -1115 GIS_4T_D_W.utf8.ps | tail -20

You'll see the PS versions of the titles, and if your  Mac  or  linux
box  is  like mine, the Chinese, Pinyin and Arabic titles will all be
correct,  if  not  very  aesthetic.   This  proves  that  the  abc2ps
translator got the titles correct (and that the terminal emulator can
handle the UTF-8 encoding when it isn't explicitly labelled as such).
The  .pdf file was derived from the .ps file by feeding it to ps2pdf,
which comes with FreeBsd and linux; there's a pstopdf  on  Macs  that
seems similar.  I don't know how to verify that the .pdf file has the
titles correct. But try downloading either the .ps or .pdf file via a
browser.   I  just  did  it  with  Safari, which renders PDF inside a
window. The three non-English titles are all trashed. The Chinese and
Arabic are Latin-1 gibberish.

If you download them with firefox, they get fed to various renderers.
I've  got  them on my Mac's screen with both Preview and Acrobat, and
both show the same Latin-1 gibberish as does Safari.  So they're  all
making similar mistakes.

The evidence for what has gone wrong is in the  Pinyin  title,  where
the  second  letter,  u-umlaut, comes out as A1/4, with a tilde above
the A.  So what they're apparently doing is interpreting the  charset
as Latin-1 (ISO 8859-1).  Why would this be?  Well, the main way that
files downloaded via HTTP is from the HTTP  headers.   Here  are  the
headers that this machine's apache server delivered to me:

HTTP/1.1 200 OK
Date: Sun, 19 Mar 2006 14:46:27 GMT
Server: Apache/1.3.34 (Unix)
Last-Modified: Sat, 11 Mar 2006 01:07:51 GMT
ETag: "321f4b-4f1-441222e7"
Accept-Ranges: bytes
Content-Length: 1265
Connection: close
Content-Type: text/vnd.abc; charset=utf-8

We can see here that the content is clearly labelled "charset=utf-8",
as  we'd  expect from the ".utf8." in the file name.  So the browsers
know that the text is UTF-8.  But all the attempts to display the .ps
file, whether via Preview, Acrobat, or the Safari browser, all garble
the titles the same way.

Anyway, I'm not too impressed by all this.  It's not just that  Apple
gets it so wrong; so do Adobe and Mozilla programs. What's really the
annoying part is that the Apple crowd just keeps  chanting  "It  Just
Works",  even  when  I post examples like this and try to get them to
exlain why it fails so badly on our Powerbooks.

My wife has been using the Middle-East news as an excuse  to  improve
her Arabic, and she does things like reads Al Jazeera's Arabic pages.
There are also a lot of local blogs in Arabic, only a  few  of  which
are  also  in  English.   Interesting  stuff.  But there's an ongoing
frustration with getting software to work  right.   She  also  has  a
Windows  box.   It does some things right that her Mac messes up, but
the MS software also garbles some other stuff that  is  good  on  the
Mac.   Both  can be summarized as "not quite there and frustrating as
all hell".

One of my motives here is to extend this musical example. I'd like to
have  music  files  like  this  that  mix  languages, not just in the
titles, but also in the lyrics. Getting even the simplest examples to
display right is an ongoing nightmare. Thus, I have a lot of songs in
Hebrew and Yiddish, which aren't quite  the  typographical  nightmare
that Arabic is, but have similar problems. I used Arabic in the above
song title simply because it's known to be the worst case, so it's  a
slightly better test case than Hebrew would have been.

BTW, Textedit seems to work fine  on  Macs  with  mutiple  languages,
other  that  the  usual  problem  of getting the charset right at the
start.  But copying from a Textedit window to other windows fails  so
often that I'm often surprised when it works correctly.

It'd be interesting to see this example work  on  linux.   I've  been
considering  testing Ubuntu, since it's aimed at exactly this sort of
multilingual user population.  But it would be  interesting  to  find
some good info on how to do such things right on any system. And Macs
are nice in some ways; it's too bad that the ad claims fall  down  so
badly.



--
   _,
   O   John Chambers
 <:#/> <jc at trillian.mit.edu>
   +   <jc1742 at gmail.com>
  /#\  in Waltham, Massachusetts, USA, Earth
  | |



More information about the Discuss mailing list