× The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.



On Wed, Feb 14, 2018 at 11:10 AM, John Yeung <gallium.arsenide@xxxxxxxxx> wrote:
Text, or characters, are the things that have meaning to human beings.
Text is *encoded* into bytes, or decoded *from* bytes. Proper decoding
of bytes depends on knowing which encoding was used to produce them.

Incidentally, I think one of the prime sources of confusion is
thinking only about encodings and giving short shrift to the human
concept of characters. When you have two files with different
encodings, many people are prone to think "I have to convert (or
translate) one encoding into the other".

In my experience, this is rife with confusion and mistakes and
fumbling around with settings and wantonly applying transformations
with no real understanding.

I believe it is much clearer (and I'm not talking down to anyone here;
this is what it took for *me* to get out of the fumbling-around state)
to not even mentally entertain the notion of "encoding -> encoding".
There can ONLY be "human text -> bytes" or "bytes -> human text". If
you need to transform bytes in one encoding to bytes in another
encoding, it MUST be a two-step process which involves first "(source)
bytes -> human text" and then "human text -> (target) bytes".

Yes, of course in a program, everything at some level has to be bytes.
But you need to keep straight *conceptually* what is human text. If
you have to, imagine a very special encoding which is reserved for
"internal use" and doesn't have anything to do with any CCSID or code
page. This is what Unicode code points are all about. Unicode is NOT
meant to be a byte-level encoding. It really is meant to be "just
numbers" which serve as the proxy for human text, such that the ONLY
two operations you have are "Unicode code points -> encoded bytes"
(a.k.a. "encoding") and "encoded bytes -> Unicode code points" (a.k.a.
"decoding").

John Y.

As an Amazon Associate we earn from qualifying purchases.

This thread ...

Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.