Re: Please consider voting for my RFE -- MIDRANGE-L

"MIDRANGE-L" <midrange-l-bounces@xxxxxxxxxxxx> wrote on 10/12/2016
01:46:06 PM:

From: John Yeung <gallium.arsenide@xxxxxxxxx>
To: Midrange Systems Technical Discussion <midrange-l@xxxxxxxxxxxx>
Date: 10/12/2016 01:46 PM
Subject: Re: Please consider voting for my RFE
Sent by: "MIDRANGE-L" <midrange-l-bounces@xxxxxxxxxxxx>

On Wed, Oct 12, 2016 at 2:27 PM, Dan <dan27649@xxxxxxxxx> wrote:

3) what's up with VARGRAPHIC(50) CCSID(1200)???

VARGRAPHIC is a kind of archaic way of saying multibyte character:

<http://stackoverflow.com/questions/13192911/what-is-the-reason-
for-the-name-vargraphic>

CCSID 1200 refers to big-endian UTF-16.

In other words, the text can be 50 Unicode *characters*, in this case

Actually, it just means that 50 2-byte code units can be stored (100 bytes
total).

encoded as UTF-16 BE, so two bytes per character. (Other Unicode
encodings, such as UTF-8, can have varying numbers of bytes per
character, including more than 2 bytes per character.)

John Y.

Actually, all UTF encodings are technically multi-byte character sets.
Even UTF-16 will require two surrogate pair 2-byte code units to encode
any Unicode code point outside of the Basic Multilingual Plane (U+10000
and up).

While UTF-32 can encode all 1+ million Unicode code point in one code unit
(4 bytes), you have to be careful not to conflate a Unicode code point
with a "character." A character (in the abstract sense) may be made up of
multiple Unicode code points, which may further be encoded in multiple
code units. You can see a list of such characters here:
http://unicode.org/Public/UNIDATA/NamedSequences.txt

eg. LATIN SMALL LETTER I WITH OGONEK AND DOT ABOVE AND TILDE U+012F
U+0307 U+0303

When encoded in UTF-8, it would be encoded in the following bytes:

X'C4AFCC87CC83' (1 character, made of 3 Unicode code points, each encoded
in 2 1-byte code units, totalling 6 bytes)

When encoded in UTF-16, it would be the following bytes:

X'012F03070303' (1 character, made of 3 Unicode code points, each encoded
in 1 2-byte big-endian code units, totalling 6 bytes)
X'2F0107030303' (1 character, made of 3 Unicode code points, each encoded
in 1 2-byte little-endian code units, totalling 6 bytes)

And finally in UTF-32BE:

X'0000012F0000030700000303' (1 character, made of 3 Unicode code points,
each encoded in 1 4-byte big-endian code units, totalling 12 bytes)
X'2F0100000703000003030000' (1 character, made of 3 Unicode code points,
each encoded in 1 4-byte big-endian code units, totalling 12 bytes)

eg. WINKING FACE U+1F609

When encoded in UTF-8, it would be encoded in the following bytes:

X'F09F9889' (1 character, made of 1 Unicode code point, encoded in 4
1-byte code units, totalling 4 bytes)

When encoded in UTF-16, it would be the following bytes:

X'D83DDE09' (1 character, made of 1 Unicode code points, each encoded in 2
2-byte big-endian code units, totalling 4 bytes)
X'3DD809DE' (1 character, made of 1 Unicode code points, each encoded in 2
2-byte little-endian code units, totalling 4 bytes)

And finally in UTF-32BE:

X'0001F609' (1 character, made of 1 Unicode code points, each encoded in 1
4-byte big-endian code units, totalling 4 bytes)
X'09F60100' (1 character, made of 1 Unicode code points, each encoded in 1
4-byte little-endian code units, totalling 4 bytes)

More reading:
https://en.wikipedia.org/wiki/Surrogate_pair
https://en.wikipedia.org/wiki/Combining_character
https://en.wikipedia.org/wiki/Han_unification

Probably way more than you or anyone on this list wanted to know, but hey
"The More You Know" https://gfycat.com/ImmenseFearlessCockatiel