|
Hi Scott,
By the standards, all XML documents must begin with the < character.
There's no other character a well-formed XML document can begin with.
How does the BOM character fit into all of this?
http://en.wikipedia.org/wiki/Byte-order_mark
http://www.opentag.com/xfaq_enc.htm#enc_bom
Aaron Bartell
http://mowyourlawn.com
On Tue, Jul 8, 2008 at 1:21 PM, Scott Klement <rpg400-l@xxxxxxxxxxxxxxxx> wrote:
Chuck,
Remember, XML is a format designed for data interchange. When you copy
a file via a tool like Windows Networking (or /QNTC, etc) or when you
copy a file via a tool like FTP, SCP, SFTP, RCP, etc, etc... how can
the encoding possibly be known if it's not part of the document?
If I do business with 100000 different suppliers, and each one needs to
send me an XML document, surely it's not reasonable to expect that every
one of them will use a proprietary i5/OS FTP extension to set the CCSID
properly when they transfer the XML to me? Especially since each of
those 100000 vendors probably has several thousand other customers all
running different systems. Is it reasonable for them to have to do
something different for everyone's system to set the proper
encoding/CCSID/codepage or whatever?
No. The data needs to be self-describing. Folks need to be able to
send the XML data and not worry about doing some extra steps to notify
each system which encoding the data is in.
I can understand you being curious as to how it works, but I don't
understand why you'd find such a practical feature to be "laughable".
By the standards, all XML documents must begin with the < character.
There's no other character a well-formed XML document can begin with.
Given that, it's really pretty easy to determine if the document is
big-endian or little-endian. Does it start with x'003c'? Then it's big
endian. Does it start with x'3c00'? Then it's little-endian. In all
ASCII and Unicode encodings, < is always x'3c', x'3c00' or x'003c'. In
EBCDIC it'd be x'4C' (or x'004C'), though I don't think EBCDIC is
technically supported by the XML standard.
Since all of the characters in the <?xml encoding="whatever"?> are
invariant, simply reading the first two bytes of the document should be
enough to let you read the "encoding" information (or to determine that
it's not there, in which case you take the default).
Not sure why this is laughable.
CRPence wrote:
Without regard to any concerns for how something does or does not--
work, well or appropriately, I offer in response to the quoted snippet...
I have always thought it laughable, that character data could ever be
considered self-describing of its encoding. With all the various [and
possibility any new] encoding schemes, including endian issues, how
could any stream of bits in any particular encoding, actually define the
encoding of the complete stream of bits? Just as with transporting
those bits, what the encoding is, must be negotiated _outside_ of the
data itself.
This is the RPG programming on the AS400 / iSeries (RPG400-L) mailing list
To post a message email: RPG400-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/rpg400-l
or email: RPG400-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/rpg400-l.
As an Amazon Associate we earn from qualifying purchases.
This mailing list archive is Copyright 1997-2025 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].
Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.