Re: Parse error ... not well formed token -- MIDRANGE-L

Scott,

Thanks for the details - it really helps in understanding all the factors.

The .dtd looks like this ( i wrote a separate pgm (not using the parser) to read
this & generate the many tables in this xml):

<?xml version='1.0' encoding='UTF-8' ?>

<!ELEMENT Data (Record+)>
<!ELEMENT Record (Account , Claim , Network? ,
...

I did add the ... encoding="utf-8" to the 1st record of the file - it still halted on same
character in same record.
I like adding the 1 statement in the pgm to cover the encoding. These files
are all generated from a no longer supported package (which is why they
are exporting the data to us). I will just need to patch or write a scan &
replace to pre-process the files.

I am still confused by sometimes seeing "UTF-8" and in your code "UTF8"..

XML_ParserCreate(XML_ENC_UTF8);

In the infocenter (under encoding) it seemed to treat these as different
definitions. Any differences?

As always - thanks,

Jim Franz

----- Original Message ----- From: "Scott Klement" <midrange-l@xxxxxxxxxxxxxxxx>
To: "Midrange Systems Technical Discussion" <midrange-l@xxxxxxxxxxxx>
Sent: Friday, October 12, 2012 6:59 PM
Subject: Re: Parse error ... not well formed token

Hi Jim,

The DTD file has an <?xml?> line at the top of it?? That's strange,
because a DTD file is not in XML format. Though, DTD files are not
normally used by XML these days, anyway. I wonder if it's really an XSD
where someone used the wrong extension? XSD files are XML documents,
and it would be quite proper to have <?xml?> in that case.

But, at any rate, Expat isn't reading that DTD file (unless you've added
code to make that happen) so the DTD is a moot point.

Your open() call is reading the file in binary mode, and so in the
absence of a <?xml encoding="utf-8"?> tag, I'm pretty sure that Expat
will assume the file to be in ISO-8859-1.

If you want to force it to consider the file to be UTF-8, you can do
that on your call top XML_ParserCreate(), which might be easier than
modifying all fo the XML files you receive :-)

myParser = XML_ParserCreate(XML_ENC_UTF8);

This will tell Expat to assume the file is in UTF-8, no matter what the
<?xml encoding="xxx"?> says.

Maybe that will help.. the other possibility, of course, is that
there's an actual character problem at the position that you mentioned
earlier... but I assume you already looked into that?

-SK

On 10/12/2012 1:48 PM, J Franz wrote:

Scott,

The file in the ifs is currently CCSID 1252. It was copied from a non-i (win or
unix)
to a win ftp server in zip format, unzipped, then copied to i thru a mapped
drive.

The xml itself has no designation in the 1st record: <?xml version="1.0"?> ,
but a separate .dtd file does have this in 1st record:
<?xml version='1.0' encoding='UTF-8' ?>

The open statement in rpgle is:
fd = open( %trim(@filename) : O_RDONLY + O_LARGEFILE );

question: so do I need to change 1st line of xml file to
<?xml version='1.0' encoding='UTF-8' ?>

(btw I'm not auth to see the netserver config if set to xlate..sent the request
up the chain)

Jim

________________________________
From: Scott Klement <midrange-l@xxxxxxxxxxxxxxxx>
To: Midrange Systems Technical Discussion <midrange-l@xxxxxxxxxxxx>
Sent: Fri, October 12, 2012 2:25:41 PM
Subject: Re: Parse error ... not well formed token

Hi Jim,

How are you using Expat? Are you giving it raw binary data, and
letting it figure out the encoding? Or are you translating it when
reading the IFS, and overriding the Expat parser to a particular encoding?

You reference CCSID 1252... I'm trying to figure out how that fits into
the equasion.

Note that CCSID 1252 is definitely not the same as UTF-8. Though, the
most basic/commonplace characters in CCSID 1252 are the same as they are
in UTF-8.... but, aside from that, they are not the same. UTF-8 (which
is CCSID 1208) supports more than million characters in one encoding...
Windows Latin-1 (CCSID 1252) supports about 200. So you can imagine
that there are many things in UTF-8 that don't exist in 1252.

But, if you're just handing binary data to Expat, it should
automatically detect the encoding from the <?xml encoding=?>" line at
the top. It doesn't use the CCSID on the file unless you write code to
make it do that.

Just make sure the data is actually encoded by the same standard as the
<?xml?> says it is... and that you aren't translating during file
transfer, or something like that.

-SK

On 10/12/2012 10:35 AM, J Franz wrote:

Using Expat parser, and UTF8 file - it failed at line 2,097,268 (so I thought
ccsid 1252 should be good?)
What looks like a blank space is a hex 1D (WRKLNK opt 5-DSPFIL)

Any way to keep parser from throwing up (the ignore opt on CPF9897 ended the
pgm)? Or a method of

verifiying all characters are valid before the parsing?

2nd issue is have several files "too big" for DSPFIL to open - any other
options? Files are in the 2 - 3 gig range
and it is not an option for exporting system to break them down. I could write
a
pgm to split them, but would

prefer not to. v6r1

Jim Franz

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.