Re: XML* vs. eXpat - encoding -- RPG400-L

Chuck,

Remember, XML is a format designed for data interchange. When you copy a file via a tool like Windows Networking (or /QNTC, etc) or when you copy a file via a tool like FTP, SCP, SFTP, RCP, etc, etc... how can the encoding possibly be known if it's not part of the document?

If I do business with 100000 different suppliers, and each one needs to send me an XML document, surely it's not reasonable to expect that every one of them will use a proprietary i5/OS FTP extension to set the CCSID properly when they transfer the XML to me? Especially since each of those 100000 vendors probably has several thousand other customers all running different systems. Is it reasonable for them to have to do something different for everyone's system to set the proper encoding/CCSID/codepage or whatever?

No. The data needs to be self-describing. Folks need to be able to send the XML data and not worry about doing some extra steps to notify each system which encoding the data is in.

I can understand you being curious as to how it works, but I don't understand why you'd find such a practical feature to be "laughable".

By the standards, all XML documents must begin with the < character. There's no other character a well-formed XML document can begin with. Given that, it's really pretty easy to determine if the document is big-endian or little-endian. Does it start with x'003c'? Then it's big endian. Does it start with x'3c00'? Then it's little-endian. In all ASCII and Unicode encodings, < is always x'3c', x'3c00' or x'003c'. In EBCDIC it'd be x'4C' (or x'004C'), though I don't think EBCDIC is technically supported by the XML standard.

Since all of the characters in the <?xml encoding="whatever"?> are invariant, simply reading the first two bytes of the document should be enough to let you read the "encoding" information (or to determine that it's not there, in which case you take the default).

Not sure why this is laughable.

CRPence wrote:

Without regard to any concerns for how something does or does not work, well or appropriately, I offer in response to the quoted snippet...

I have always thought it laughable, that character data could ever be considered self-describing of its encoding. With all the various [and possibility any new] encoding schemes, including endian issues, how could any stream of bits in any particular encoding, actually define the encoding of the complete stream of bits? Just as with transporting those bits, what the encoding is, must be negotiated _outside_ of the data itself.

This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.