RE: expat XML parsing problem, CPF9897: Parse error at line 1: unknown encoding -- MIDRANGE-L

Hi Scott,

At first thanks to people who created EXPAT.
Let me change and run the program with utf-8,ccsid=1028.
If I won't be able to run the program properly with utf-8 I would try to write my parsing codes.
I was lost,you've lightened my path on this job.Thank you Scott.You are a good teacher.

Date: Thu, 12 Jan 2012 13:33:00 -0600
From: midrange-l@xxxxxxxxxxxxxxxx
To: midrange-l@xxxxxxxxxxxx
Subject: Re: expat XML parsing problem, CPF9897: Parse error at line 1: unknown encoding

Hello Cevdet,

Expat isn't "Scott Klement's tool". While it's true that I wrote some
articles about how to use it from RPG, I had no part in it's development.

Out of the box, Expat supports 4 different input encodings. They are:

* US-ASCII (deprecated)
* ISO-8859-1
* UTF-8
* UTF-16

Note two things:
1) ISO-8859-9 is NOT one of them.
2) It does not understand _any_ form of EBCDIC at all.

Expat also lets you register your own encoding handler via the
XML_SetUnknownEncodingHandler() API. This is Expat's official method of
adding support for other encodings besides the 4 listed above.

However, even if you use XML_SetUnknownEncodingHandler, it cannot handle
EBCDIC encodings, because Expat still expects the basic characters
needed to parse the <?xml?> declaration to match what they are in
ASCII/Unicode. So it never supports EBCDIC, ever.

My solution to this was very simple: If I'm not expecting the data to
have one of the four Expat-friendly encodings, I let IBM i translate my
input file into UTF-8. Since Expat supports UTF-8, this solves my
problem completely.

But you're not doing that! Indeed, your code is telling IBM i to
translate the data to EBCDIC, something that can't ever work with Expat.
Let's take a look at your code:

/free
fd = open('/tmp/tcmb_kur.xml': O_RDONLY+O_TEXTDATA);
if (fd< 0);
EscErrno(errno);
endif;

Here you are specifying O_TEXTDATA without specifying which code page or
CCSID to translate the data to. The result? It's going to translate
the data, using code pages (not ccsids) to the job's default ccsid.
And the job will, undoubtedly, be EBCDIC.

Instead, please consider doing this:

fd = open( '/tmp/tcmb_kur.xml'
: O_RDONLY + O_TEXTDATA + O_CCSID
: 0
: 1208 );
if (fd< 0);
EscErrno(errno);
endif;

O_CCSID means you want to use CCSIDs (instead of the default, which is
code pages). This is important, since Unicode is not supported with code
pages.

I've also told the open() API that my program's data is in CCSID 1208.
1208 is UTF-8, so the IFS APIs will pass data to my program in UTF-8
(not EBCDIC). This means it'll be hard to view the data in the
debugger, which is unfortunate, but it'll be understood by Expat.
(Provided that you tell Expat to expect UTF-8)

Strangely, though, you are telling Expat to expect the data in
ISO-8859-1. Since there are characters in ISO-8859-9 that don't exist
in ISO-8859-1, that's not a good choice. (And it's why I'd use UTF-8, it
supports nearly every character known to man.)

Here's what you have:

p = XML_ParserCreate(XML_ENC_ISO8859_1);
if (p = *NULL);
callp close(fd);
die('Couldn''t allocate memory for parser');
endif;

Here's what I recommend:

p = XML_ParserCreate(XML_ENC_UTF8);
if (p = *NULL);
callp close(fd);
die('Couldn''t allocate memory for parser');
endif;

So now I'm telling Expat to expect the data in UTF-8 format, and since
IBM i has been asked to translate the data to 1208, everyone should be
happy.

(Unless the data in your file doesn't match the CCSID it has been marked
with -- but, we'll cross that bridge when/if we come to it.)
--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.