Michael Ryan wrote:
Looking for some opinions here...I have a complex (to me) XML document
that I'll need to parse and extract the data. I've used Scott's port
of eXpat before, and that's worked well for me. Finally on V5R4, so I
thought about using the XML* opcodes. Any opinions on using one over
the other? Are there cases where one approach works better?
My thoughts:
XML-INTO:
XML-INTO is the simplest to use XML parser I've seen in any programming
language. (Obviously, "simple" is in the eye of the beholder, so this
is just my opinion).
XML-INTO is very easy to use -- for simple things. But, as others have
pointed out, it's limited.
a) It's support for namespaces is not-so-good. Since namespaces are
HEAVILY used in web services, this is problematic for me.
b) I don't like the way XML-INTO (and also XML-SAX, IIRC) treat
character set encodings. They assume that the CCSID on the outside of a
file (or the CCSID of your job in the case of an alphanumeric string) is
the CCSID of the XML data inside the file.
XML has it's own mechanism for specifying it's encoding. There's a tag
that reads something like this:
<?xml version="1.0" encoding="utf-8"?>
A program is supposed to read that tag to determine the encoding -- but
the RPG parsers ignore it completely. This is problematic because some
folks (many folks!) get their XML data transferred from Windows, Unix,
etc, and the XML data is sent with a tool like FTP, Windows Networking,
SSH or (most commonly) HTTP. Most of these file transfer methods have no
knowledge of CCSIDs and do not ensure that the CCSID on the i5 is set to
the right encoding. Your customers can't be expected to make sure the
data is encoded a special way! That's one of the big advantages of XML:
The data is self-describing. So to handle XML properly, you'd have to
write a program that reads the opening <?xml tag and determins the right
encoding for you. Seems cumbersome to me!
But, for sure, they work. They run very fast, and have the big
advantage that you don't have to install any additional software on your
(or your client's) computer to use them.
XML-SAX:
If you use both this one and Expat, you'll see that they are extremely
similar. XML-SAX has more different event types than Expat does, and
personally I find Expat easier for that reason, but that's also the
reason that XML-SAX is a true SAX parser, whereas Expat can only say
it's a "stream parser, similar to a SAX parser". But, personally I
prefer the way Expat handles events, I find it simpler -- but then
again, I might just be set in my ways, having used Expat for a long time
already.
XML-SAX also suffers from the character set translation issue I
mentioned for XML-INTO. (That is, if I recall correctly).
Another issue with XML-SAX and XML-INTO is the length of strings in V5R4
and earlier. In those releases, strings are limited to 64k. So if you
receive the XML data as a parameter, and it's longer than 64k, you have
to write the parameter to a stream file, then have XML-INTO/XML-SAX read
the stream file. I find that a little cumbersome.
Having said that -- I find both XML-INTO and XML-SAX to be worthwhile
tools. Not perfect, they have flaws as I mentioned, but they're very
fast, and built-in to the language, fully supported by IBM, etc.
EXPAT:
Expat is one of the oldest XML parsers extant today. It was developed
by James Clark, who was, at the time, the technical lead of the XML
Working Group at W3C as they were developing the XML standard. The
Expat engine is the underlying parsing engine in some XML projects you
may have heard of: The Apache HTTP Server, PHP, OpenOffice, Perl,
Python and Mozilla. There are lots of other things as well, but those
are the best known ones.
Both a blessing and a curse is the fact that Expat doesn't know how to
read files. You have to write a routine to read the file (or whatever
medium you're getting your XML data from) and feed it into the parser.
It might be from a stream file, a variable, data queue, socket, user
space, MQ Series, even (cough) a physical file. (But, please don't use
a physical file!) You can even read your data directly from a pointer
or a user space, thus easily circumventing the V5R4 size limits. Since
you write it, anything that lets you load bytes into memory can be a
medium for Expat. While this provides flexibility, it also means that
you have to write/test/debug more code... (a minor thing for me, since
I rarely make mistakes when writing code that reads a file -- but
perhaps it's a bigger deal to you.)
Expat tends to be a bit more complicated in some other ways. Handling
null-terminated strings (a standard mechanism in Expat) can be cumbersome.
Dealing with XML attributes can be very cumbersome in Expat, as it
requires quite a bit of work with pointers -- more than one person has
run away screaming when I've shown them how to code the pointer logic to
read the attributes of an XML document.
But, it doesn't have the CCSID problem that some of the other choices
have. Expat has the ability to determine the encoding from the <?xml
encoding="XXX"?> directive. But, that ability isn't manditory, if you
want to override it to a different encoding, you can.
Plus, of course, unlike the XML-SAX and XML-INTO choices, Expat requires
additional software to be downloaded and installed.
All three options lack the ability to validate a document against a
schema, as well as the ability to write XML documents. This isn't
important to me personally, but many people have told me it's a priority
for them. So, that's another consideration.
Just my opinion.
As an Amazon Associate we earn from qualifying purchases.