×
The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.
We are parsing XML sent up to the IBM i from our field associates. I'm
working on how to handle emojis that are part of the UTF-8 content of
the XML.
I want to be able to auto-correct the XML in the parsing program as much
as I can. That is, put a space where there is some non-convertible
content, such as a happy face emoji.
With much good help from the group, I changed the option of XML-SAX to
use the option, ccsid=ucs2 - this was very helpful and got us over
problems with characters that EBCDIC does not have, such as ellipses and
em dashes.
Now I have code that will take the reported length when an
*XML_EXCEPTION event occurs - and I am working under the assumption that
this is the __number of bytes__ to the error in the XML file.
The documentation says that the "...value of the string-length parameter
is the length of the document that was parsed up to and including the
point where the exception occurred."
There is also the RNX0351 MSGID, which says the following : "...parser
detected an error at offset &3..." - and every time I've got the 351
status, that offset is 1 less than the "string-length" value - makes
sense, offset is 1 less than position.
OK - so I have a problem - I have XML that a horizontal ellipses (the 3
periods thing) and an emoji (a 4-byte string for a pair of eyes). The
ellipsis is represented with 3 bytes for the 1 character.
The ellipsis passes fine - it gets converted to UCS-2 from UTF-8 with no
problem.
The emoji triggers the exception, and turns on the 351 status.
The "position" parameter is 379 in this particular XML. The bytes at
that position are x'6120F09F' - and this is odd, because x'61' and x'20'
are valid UTF-8 characters (lower case a and a space).
The actual offending bytes are x'F09F9180', which is the EYES emoji. But
that string is 2 bytes farther along in the XML file. These are at
position 381 (bytes) of the XML file, and that is what I want to replace
with a space.
I therefore cannot use the "error position" value to replace the 4-byte
content _at_ that position with a space, as I'd like to do, then
recursively call the procedure that does the XML-SAX operation.
I think I understand - although I'm __very__ unhappy right now - in
another list, Barbara said, "...the value for the CCSID option is the
CCSID of the data to be passed to the XML-SAX handler procedure..."
This suggests that the lengths under consideration are __character__
lengths, not __byte__ lengths - so the reported position of 379 is
r"right" - there are 379 UTF-8 (and UCS-2) characters, including the one
where the error occurred.
So if that IS the case, I'm not sure what to do. We can manually make
the change - we might be only a couple or a few bytes away from the
actual problem - and run it through again. Of course, the more "odd"
characters there are, the farther off we will be.
I'm aware of - and would like to use - the SQL XML functionality, which
does NOT seem to have these problems - I've tried a few bits. Problem
is, it'd be a rewrite of this program.
So since there is a strong need for something to help, and this MIGHT be
a curious case, maybe I should put my current solution in place, then
get to the SQL one later (probably translates to never!!!).
**__Or would IBM provide something to give the actual byte position of
the error in the original file.__**
OK, I hope that's made some sense - I'm not looking forward to telling
this to my boss.
Cheers
Vern
As an Amazon Associate we earn from qualifying purchases.
This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact
[javascript protected email address].
Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.