Re: XML* vs. eXpat -- RPG400-L

Michael Ryan wrote:

Looking for some opinions here...I have a complex (to me) XML document
that I'll need to parse and extract the data. I've used Scott's port
of eXpat before, and that's worked well for me. Finally on V5R4, so I
thought about using the XML* opcodes. Any opinions on using one over
the other? Are there cases where one approach works better?

My thoughts:

XML-INTO:

XML-INTO is the simplest to use XML parser I've seen in any programming language. (Obviously, "simple" is in the eye of the beholder, so this is just my opinion).

XML-INTO is very easy to use -- for simple things. But, as others have pointed out, it's limited.

a) It's support for namespaces is not-so-good. Since namespaces are HEAVILY used in web services, this is problematic for me.

b) I don't like the way XML-INTO (and also XML-SAX, IIRC) treat character set encodings. They assume that the CCSID on the outside of a file (or the CCSID of your job in the case of an alphanumeric string) is the CCSID of the XML data inside the file.

XML has it's own mechanism for specifying it's encoding. There's a tag that reads something like this:
<?xml version="1.0" encoding="utf-8"?>

A program is supposed to read that tag to determine the encoding -- but the RPG parsers ignore it completely. This is problematic because some folks (many folks!) get their XML data transferred from Windows, Unix, etc, and the XML data is sent with a tool like FTP, Windows Networking, SSH or (most commonly) HTTP. Most of these file transfer methods have no knowledge of CCSIDs and do not ensure that the CCSID on the i5 is set to the right encoding. Your customers can't be expected to make sure the data is encoded a special way! That's one of the big advantages of XML: The data is self-describing. So to handle XML properly, you'd have to write a program that reads the opening <?xml tag and determins the right encoding for you. Seems cumbersome to me!

But, for sure, they work. They run very fast, and have the big advantage that you don't have to install any additional software on your (or your client's) computer to use them.

XML-SAX:

If you use both this one and Expat, you'll see that they are extremely similar. XML-SAX has more different event types than Expat does, and personally I find Expat easier for that reason, but that's also the reason that XML-SAX is a true SAX parser, whereas Expat can only say it's a "stream parser, similar to a SAX parser". But, personally I prefer the way Expat handles events, I find it simpler -- but then again, I might just be set in my ways, having used Expat for a long time already.

XML-SAX also suffers from the character set translation issue I mentioned for XML-INTO. (That is, if I recall correctly).

Another issue with XML-SAX and XML-INTO is the length of strings in V5R4 and earlier. In those releases, strings are limited to 64k. So if you receive the XML data as a parameter, and it's longer than 64k, you have to write the parameter to a stream file, then have XML-INTO/XML-SAX read the stream file. I find that a little cumbersome.

Having said that -- I find both XML-INTO and XML-SAX to be worthwhile tools. Not perfect, they have flaws as I mentioned, but they're very fast, and built-in to the language, fully supported by IBM, etc.

EXPAT:

Expat is one of the oldest XML parsers extant today. It was developed by James Clark, who was, at the time, the technical lead of the XML Working Group at W3C as they were developing the XML standard. The Expat engine is the underlying parsing engine in some XML projects you may have heard of: The Apache HTTP Server, PHP, OpenOffice, Perl, Python and Mozilla. There are lots of other things as well, but those are the best known ones.

Both a blessing and a curse is the fact that Expat doesn't know how to read files. You have to write a routine to read the file (or whatever medium you're getting your XML data from) and feed it into the parser. It might be from a stream file, a variable, data queue, socket, user space, MQ Series, even (cough) a physical file. (But, please don't use a physical file!) You can even read your data directly from a pointer or a user space, thus easily circumventing the V5R4 size limits. Since you write it, anything that lets you load bytes into memory can be a medium for Expat. While this provides flexibility, it also means that you have to write/test/debug more code... (a minor thing for me, since I rarely make mistakes when writing code that reads a file -- but perhaps it's a bigger deal to you.)

Expat tends to be a bit more complicated in some other ways. Handling null-terminated strings (a standard mechanism in Expat) can be cumbersome.

Dealing with XML attributes can be very cumbersome in Expat, as it requires quite a bit of work with pointers -- more than one person has run away screaming when I've shown them how to code the pointer logic to read the attributes of an XML document.

But, it doesn't have the CCSID problem that some of the other choices have. Expat has the ability to determine the encoding from the <?xml encoding="XXX"?> directive. But, that ability isn't manditory, if you want to override it to a different encoding, you can.

Plus, of course, unlike the XML-SAX and XML-INTO choices, Expat requires additional software to be downloaded and installed.

All three options lack the ability to validate a document against a schema, as well as the ability to write XML documents. This isn't important to me personally, but many people have told me it's a priority for them. So, that's another consideration.

Just my opinion.