× The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.



> From: jt
> 
> Up to 4 bytes?  Did NOT know that.  You mean to tell me the are
practical
> uses for a Codepage or CCSID that encompasses > 64K characters??  Just
> wondering what that would be, as (at least used ta) view DBCS as
> sufficient.

It's not that simple, JT.  It gets VERY complicated, but the semi-short
version is that Unicode supports up to 1.1 million code points.  The
easiest representation of Unicode is UCS format, in which the value of
the code point is mapped pretty much directly into an integer.  However,
since either byte in a UCS-2 character can be any value, if you lose a
single byte in a UCS transfer, the entire rest of the stream can be
lost.

UTF-8 was designed to transport UCS data from one machine to another
with the ability to resync after dropping a character.  In order to do
that, some fancy bit-packing is done where the first character is either
<128 (this is single byte character, allowing most ASCII to be pure
passthrough) or >192 (for multi-byte).  Anything between 128 and 192 is
the non-leading character of a multi-byte UTF-8 encoding.  With this
encoding, UTF-8 encoding of ASCII text files is actually significantly
smaller than UCS-2.

To summarize: In UTF-8 coding, all Unicode values from 0-65535 (this is
called the Base Multilingual Plane, which includes most ideographs as
well as traditional Latin languages) can be encoded in one, two or three
bytes.  Special codes outside the BMP (primarily uncommon ideographs)
require four bytes.  (Technically, UTF-8 can transform up to UCS-4
encoding, which worst case needs up to six bytes per character, although
the larger conversions wouldn't be necessary unless the Unicode standard
was significantly expanded.)

There is another encoding known as UTF-16.  This is the native format
for Java String objects.  This encoding gets REALLY bizarre REALLY
quick, because there are issues of byte order.

For a little more on the encoding/decoding (although from a Mac
standpoint), read here:

http://www1.tip.nl/~t876506/utf8tbl.html

Joe


As an Amazon Associate we earn from qualifying purchases.

This thread ...

Follow-Ups:
Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.