RE: Unicode source file -- JAVA400-L

>Hmm.. is the AS/400 "big endian"?  I don't even remember
>which way PCs are now, Big or Little Endian.

>In case you didn't know, Endian refers to the orders of the
>bytes for a word.  Does the "big" byte come first or the
>"little" byte.


OS/400 is Big Endian.  Linux on iSeries is Big Endian.  Motorolla & Sun are
Big Endian (high order or "big" byte first).

Alpha, all Intel families and (IIRC) MIPS are all little endian.  Linux on
Intel and all Microsoft OS are Little Endian.

Note carefully that Java cleverly straddles all this.  Internally, it is
implemented as whatever endian the local CPU is.  By disallowing casts of
larger over smaller or smaller over larger, and by defining JDBC and I/O
carefully, it appears to be Big Endian.

That includes the handful of interfaces for reading Unicode streams.  If
the underlying data is actually little endian Unicode (see next paragraph),
you'll have to reverse the bytes yourself.

Unicode is definitely a problem.  It was supposed to be a Big Endian
standard until Microsoft balked.  Now you can be either way, just like most
of the rest of the world.  There is an optional "throwaway" character
(0xFFFE) which reveals the intended endian of the rest of the data stream
(if you load 0xFEFF you know you got it backwards).  Almost no one I know
of actually includes it.  Since the example shows a "save as" with Unicode
and Unicode Big Endian, one presumes that the former is Little Endian.

In most cases, since the first 256 code points tend to be frequently used,
it is usually possible, by inspection, to tell what "endian" a Unicode file
uses if you don't know ahead of time.  If  it came from a PC, it is almost
certainly little endian unicode.

Generally speaking, UTF-8 (a byte encoded version of Unicode and also a
choice given before) is much easier to deal with.   UTF-8 will look almost
like ASCII with some strange three character sequences here and there.
UTF-8 will be handled much more adroitly at the end of the day, because you
don't have to worry about Java versus C on an Intel box -- both languages
can handle the translation readily, because it isn't byte-order dependent.
UTF-8 is 100 per cent interchangable with Unicode, because it is just a
different encoding of it.

Finally, since this represents a choice the user actually has, is there any
Unicode in the stream to begin with?  If the data actually is ISO 8859-1
(identical to the first 256 code points of Unicode), then see if there is
an option to store them as ordinary files.  The main practical snag to this
nowadays would be the actual Euro character if one is in the US or Western
Europe.


Larry W. Loen  -   Senior Linux, Java, and iSeries Performance Analyst