Re: CPYFRMIMPF and Unicode - UTF-16 in particular -- MIDRANGE-L

Comments inline

On 3/4/2013 12:57 PM, CRPence wrote:

On 04 Mar 2013 09:07, Vernon Hamberg wrote:

I believe I've an answer that ends up being very flexible.
CPYFRMSTMF. <<SNIP>>
CPYFRMSTMF ... DBFCCSID(*FILE) ... CVTDTA(*AUTO)<<SNIP>>
This was preceded by a CRTPF QTEMP/FLATFILE RCDLEN(3000)

That should convert to EBCDIC using the Job [default] CCSID. The
program-described file effectively has no CCSID, but *AUTO still must
effect conversion from ASCII encoding to EBCDIC encoding. That would
sure limit the data that can be processed; i.e. sure defeats the purpose
of having used UTF-16 :-)

Agreed - but this data is restricted to north american use for now, so this seems safe enough. And we are trying to get away from some user deciding how to save an email attachment, where they might be told to save as ANSI. But users make mistakes!

With UTF-16, there is an extra row interleaved - because there is a
Unicode CRLF, and the conversion sees the CR and the LF as separate.
No problem - this is easy to clean up!

I do not see that issue on v5r3; my files have just *LF. Seems like
a defect. Or perhaps I do not understand what is being described.?

When I specify ENDLINFMT(*ALL), CPYFRMSTMF converts any eligible such marker as the end of a record. That marker is removed.

Now in UTF-16, in my case, I have a CRLF - that looks like this in little-endian - x0D000A00 - the command processor takes the first x0D and pads the field with blanks. The next record starts with x00. Then the x0A triggers another end-of-record, so we get a field that starts with x00 and is padded with blanks.

If you have a file saved on a Unix box or maybe a Macintosh, it'll have only one of these, either x0D or x0A.

<<SNIP>> And the nulls (UTF-16 only) can be cleaned up with an SQL
REPLACE function. <<SNIP>>

Why would a null character appear in /text/ data? The only expected
control characters in a text file are EOR; e.g. *CR, *CRLF, *LF, *LFCR.

CPYFRMSTMF appears to do a byte-by-byte conversion, as it seemed CPYFRMIMPF did as well, just not all the time. So the typical UTF-16 representation for our situation is ANSI characters alternating with x00's. We also have tab characters, which end up as x05 in the PF.

Any cautions are much appreciated

The ability to have embedded CRLF in delimited column data would be
lost, because the stream is split into database records for each
apparent EOR, even if the control characters were not meant to be seen
as EOR. Obviously having to choose a fixed record length can be an
issue, since there is no such limit for the stream data.

Probably not an issue here - I've thought of this, but this is all textual data and would have no control characters in it, other than the tab.

still, this does look pretty cool - no transform needed, similar
effect to how we were using CPYFRMIMPF for ANSI-encoded stream files.

So does that mean that the noted CPYFRMSTMF is functional for both
UTF-16BE and UTF-16LE, such that effectively it does the transform of
the data to enable the CCSID xlation from 1200 to the defaulted EBCDIC
CCSID?

If it can handle the transform, then it would seem odd that the
Byte-Order-Mark (BOM) would not be dropped per its no longer having
meaning in a database file.mbr.

It doesn't do a 1200 - EBCDIC transform - remember that these stream files are flagged with CCSID 1252, as coming from Windows. They would likely be flagged as 819 if FTP is the transfer mechanism, but these are coming in emails.