XML-SAX *XML_EXCEPTION event and reported parsed length question -- RPG400-L

Y'all

We are parsing XML sent up to the IBM i from our field associates. I'm working on how to handle emojis that are part of the UTF-8 content of the XML.

I want to be able to auto-correct the XML in the parsing program as much as I can. That is, put a space where there is some non-convertible content, such as a happy face emoji.

With much good help from the group, I changed the option of XML-SAX to use the option, ccsid=ucs2 - this was very helpful and got us over problems with characters that EBCDIC does not have, such as ellipses and em dashes.

Now I have code that will take the reported length when an *XML_EXCEPTION event occurs - and I am working under the assumption that this is the __number of bytes__ to the error in the XML file.

The documentation says that the "...value of the string-length parameter is the length of the document that was parsed up to and including the point where the exception occurred."

There is also the RNX0351 MSGID, which says the following : "...parser detected an error at offset &3..." - and every time I've got the 351 status, that offset is 1 less than the "string-length" value - makes sense, offset is 1 less than position.

OK - so I have a problem - I have XML that a horizontal ellipses (the 3 periods thing) and an emoji (a 4-byte string for a pair of eyes). The ellipsis is represented with 3 bytes for the 1 character.

The ellipsis passes fine - it gets converted to UCS-2 from UTF-8 with no problem.

The emoji triggers the exception, and turns on the 351 status.

The "position" parameter is 379 in this particular XML. The bytes at that position are x'6120F09F' - and this is odd, because x'61' and x'20' are valid UTF-8 characters (lower case a and a space).

The actual offending bytes are x'F09F9180', which is the EYES emoji. But that string is 2 bytes farther along in the XML file. These are at position 381 (bytes) of the XML file, and that is what I want to replace with a space.

I therefore cannot use the "error position" value to replace the 4-byte content _at_ that position with a space, as I'd like to do, then recursively call the procedure that does the XML-SAX operation.

I think I understand - although I'm __very__ unhappy right now - in another list, Barbara said, "...the value for the CCSID option is the CCSID of the data to be passed to the XML-SAX handler procedure..."

This suggests that the lengths under consideration are __character__ lengths, not __byte__ lengths - so the reported position of 379 is r"right" - there are 379 UTF-8 (and UCS-2) characters, including the one where the error occurred.

So if that IS the case, I'm not sure what to do. We can manually make the change - we might be only a couple or a few bytes away from the actual problem - and run it through again. Of course, the more "odd" characters there are, the farther off we will be.

I'm aware of - and would like to use - the SQL XML functionality, which does NOT seem to have these problems - I've tried a few bits. Problem is, it'd be a rewrite of this program.

So since there is a strong need for something to help, and this MIGHT be a curious case, maybe I should put my current solution in place, then get to the SQL one later (probably translates to never!!!).

**__Or would IBM provide something to give the actual byte position of the error in the original file.__**

OK, I hope that's made some sense - I'm not looking forward to telling this to my boss.

Cheers
Vern