Re: ACS Run SQL Scripts saves files as UTF-8 BOM -- MIDRANGE-L

Scott,

I didn't take a position on the BOM. In fact I took pains to state,
before every point that I made, that I was not taking any position or
making any recommendation. I was just pointing out and clarifying
facts.

On Fri, Mar 17, 2017 at 3:20 AM, Scott Klement
<midrange-l@xxxxxxxxxxxxxxxx> wrote:

The Unicode "standard" says that the BOM should only be included in UTF-8
when necessary. (I'm paraphrasing... but it's something to that effect.)

The exact wording is "neither required nor recommended". Personally
(yes, my own opinion here) is that the verbiage was designed to be as
neutral and as precise as possible. I believe the phrase is sometimes
interpreted too strongly as an explicit recommendation against the
BOM.

You are saying that it's not necessary here because Notepad++ can figure it
out? The trouble with that is that Notepad++ has to read THE WHOLE FILE and
analyze every single character. Even then it can't be 100% certain, it
assumes that if something would fit a valid UTF-8 character it therefore
must be UTF-8 rather than ASCII used with control characters...

Here's the thing: As long as you are receiving data from someone else,
there is never 100% certainty. Period. That's an unfortunate but hard
fact.

In a way, the CCSID system might actually be even more prone to
mistakes, because it's easy to change a file's purported CCSID without
changing its contents. It's also possible to prepend the UTF-8 BOM on
data that isn't really UTF-8 (and people have reported encountering
this), but this strikes me as harder to do and less commonly occurring
in practice.

Anyway, some people have thus concluded that if you want a higher
degree of certainty, you have to sniff the contents regardless. All I
was saying is that NetServer or anything else *could* be made to do
what Notepad++ does; that it's not "impossible" to detect UTF-8
without the BOM.

I agree that Notepad++ does a nice job of this... but don't you think it'd
be a LOT nicer if applications could determine this by just reading the
first 3 bytes? Not to mention much more performant? And that's what the
BOM does in this case.

Well, I disagree that the BOM makes much difference in performance. At
least, I don't think it *should* make much difference in performance.
Why do I say that? Because the BOM isn't a full CCSID system. It's
just an indicator that Unicode follows. If the actual encoding of the
data isn't Unicode, then there is no BOM, and of course no shortcut is
available. But what if it is Unicode? Well, like you say, if the first
three bytes are the UTF-8 BOM, then you can read the rest as UTF-8
(with high enough confidence that you don't bother to do further
checking). But if there's no BOM, you could *also* just read in the
data as UTF-8! No separate pass for checking. True, if the data
happens not to be UTF-8, then you will have to go back and try another
encoding or give up with an error. But that's the same as if you had
picked a non-Unicode encoding at the start, and it turns out to have
been UTF-8 after all.

So I would say the BOM provides a meaningful and reliable performance
improvement only *if* the Unicode standard changes to *require* a BOM.
Otherwise the improvement is hit or miss depending on the data you
actually receive and on your first guess.

Incidentally, there is an IETF memo (so, about as official as it gets
for all things Internet), that expressly recommends omitting the BOM
in contexts where UTF-8 is mandated (as opposed to contexts allowing a
mixture of encodings) and also omitting the BOM where there is some
other means of identifying the encoding (such as metadata; CCSID would
fall under this category). This is RFC3629, in case anyone wants to
actually look it up.

[The BOM for UTF-8] makes applications (by major vendors like
Microsoft whose software I cannot change) work properly, and makes things
much more clear and efficient even for those that can detect UTF-8 properly
without it.

Just my take on it, of course.

While I am less convinced of the efficiency, my own take is that the
BOM for UTF-8 does make things clearer, and I do appreciate that. I
also appreciate that sometimes, you have to throw the BOM in there
just to make things work (because the other guy's software
*incorrectly* rejects UTF-8 without BOM).

John Y.