|
Chuck,
Remember, XML is a format designed for data interchange. When you
copy a file via a tool like Windows Networking (or /QNTC, etc) or
when you copy a file via a tool like FTP, SCP, SFTP, RCP, etc, etc...
how can the encoding possibly be known if it's not part of the
document?
If I do business with 100000 different suppliers, and each one needs
to send me an XML document, surely it's not reasonable to expect that
every one of them will use a proprietary i5/OS FTP extension to set
the CCSID properly when they transfer the XML to me? Especially since
each of those 100000 vendors probably has several thousand other
customers all running different systems. Is it reasonable for them to
have to do something different for everyone's system to set the
proper encoding/CCSID/codepage or whatever?
No. The data needs to be self-describing. Folks need to be able to send the XML data and not worry about doing some extra steps to
notify each system which encoding the data is in.
I can understand you being curious as to how it works, but I don't understand why you'd find such a practical feature to be "laughable".
By the standards, all XML documents must begin with the < character. There's no other character a well-formed XML document can begin with.
Given that, it's really pretty easy to determine if the document is big-endian or little-endian. Does it start with x'003c'? Then it's
big endian. Does it start with x'3c00'? Then it's little-endian. In
all ASCII and Unicode encodings, < is always x'3c', x'3c00' or
x'003c'. In EBCDIC it'd be x'4C' (or x'004C'), though I don't think
EBCDIC is technically supported by the XML standard.
Since all of the characters in the <?xml encoding="whatever"?> are invariant, simply reading the first two bytes of the document should
be enough to let you read the "encoding" information (or to determine
that it's not there, in which case you take the default).
Not sure why this is laughable.
CRPence wrote:
Without regard to any concerns for how something does or does not work, well or appropriately, I offer in response to the quoted snippet...
I have always thought it laughable, that character data could ever be considered self-describing of its encoding. With all the various [and possibility any new] encoding schemes, including endian issues, how could any stream of bits in any particular encoding, actually define the encoding of the complete stream of bits? Just as with transporting those bits, what the encoding is, must be negotiated _outside_ of the data itself.
As an Amazon Associate we earn from qualifying purchases.
This mailing list archive is Copyright 1997-2025 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].
Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.