On Wed, Feb 14, 2018 at 11:10 AM, John Yeung <gallium.arsenide@xxxxxxxxx> wrote:
Text, or characters, are the things that have meaning to human beings.
Text is *encoded* into bytes, or decoded *from* bytes. Proper decoding
of bytes depends on knowing which encoding was used to produce them.
Incidentally, I think one of the prime sources of confusion is
thinking only about encodings and giving short shrift to the human
concept of characters. When you have two files with different
encodings, many people are prone to think "I have to convert (or
translate) one encoding into the other".
In my experience, this is rife with confusion and mistakes and
fumbling around with settings and wantonly applying transformations
with no real understanding.
I believe it is much clearer (and I'm not talking down to anyone here;
this is what it took for *me* to get out of the fumbling-around state)
to not even mentally entertain the notion of "encoding -> encoding".
There can ONLY be "human text -> bytes" or "bytes -> human text". If
you need to transform bytes in one encoding to bytes in another
encoding, it MUST be a two-step process which involves first "(source)
bytes -> human text" and then "human text -> (target) bytes".
Yes, of course in a program, everything at some level has to be bytes.
But you need to keep straight *conceptually* what is human text. If
you have to, imagine a very special encoding which is reserved for
"internal use" and doesn't have anything to do with any CCSID or code
page. This is what Unicode code points are all about. Unicode is NOT
meant to be a byte-level encoding. It really is meant to be "just
numbers" which serve as the proxy for human text, such that the ONLY
two operations you have are "Unicode code points -> encoded bytes"
(a.k.a. "encoding") and "encoded bytes -> Unicode code points" (a.k.a.
"decoding").
John Y.
As an Amazon Associate we earn from qualifying purchases.