RE: Creating a unicode file in the IFS -- RPG400-L

Wow, excellent sleuthing, Scott! I am, as usual, in awe!

The final verdict is still out since I won't be able to access the US-based
system until tomorrow. But confidence is high due to your test and some of
my own.

After reading your message, I tried the following. I have a directory of
binary objects that we'll call 'bin' for now.

(from QSHELL)

set -x; tar cf from-qsh.tar binaries; tar cf rewrap.tar from-qsh.tar; mkdir
tmp; cd tmp; tar xf ../rewrap.tar; tar xf ./from-qsh.
tar

+ tar cf from-qsh.tar binaries # Create a tarball
+ tar cf rewrap.tar from-qsh.tar # Wrap within another tarball
+ mkdir tmp # Create a work area
+ cd tmp
+ tar xf ../rewrap.tar # Untar the "outer" tarball
+ tar xf ./from-qsh.tar # Try to untar the extracted inner
tarball
tar: 001-2320 Searching for valid archive header.

tar: 001-2301 End of archive volume 1 reached.

tar: An archive volume change is required.

tar: Ready for archive volume 1.

tar: Enter an archive path name or a period (".") to exit:

Failure! from-qsh.tar was not maintained intact - even on this same system,
even without transfer/copy!

--------

With QP2TERM, the same process worked flawlessly.

Note: my favored approach would have involved the diff utility but, contrary
to unix standards, neither version of diff (QSH/QP2TERM) does proper binary
comparison.

Now I shall set about installing 7z on a system where I lack admin rights.
Should be an interesting venture.

Dennis Lovelady
http://www.linkedin.com/in/dennislovelady
--
"Always do right. This will gratify some people and astonish the rest."
-- Mark Twain

Whenever you're discussing CCSIDs, it's always important to distinguish
between the "label" and the "fact". By "label", I mean the number that
is assigned to an object in it's object description. By "fact", I mean
that actual table of ASCII, EBCDIC, etc that the computer used to
generate the original data when it was created. In a perfect world,
these would both be the same CCSID, but frequently (especially in data
interchange) they are not.

An analogy: Suppose you have made large batches of both strawberry and
raspberry jam. (Mmmmm.. homemade jam). You put your jam into jars,
and
stick a label on each jar that says either "strawberry" or "raspberry".
Otherwise, you might not know (without tasting, anyway) which jam is
in which bottle.

In that scenario, you have a "label" (the sticker on the outside) and
the "fact" (the actual fruit in the jar.) You hope they are the same -
-
but there's always a chance for a mistake. A mislabeled jar would
result in the great tragedy of eating strawberry instead of raspberry.
(Okay, I guess there have been bigger tragedy's in history -- but you
get the idea.)

The same is true with your data exchange. It'd be nice if CCSID in the
obj description matches the actual encoding used for the data in the
file. But it often gets mixed up -- so I suggest not paying too much
attention to that file descriptor. Don't place too much trust in the
"label". Instead, evaluate the data itself.

Another point: Some files aren't text, and therefore the CCSID on the
"label" is irrelevant. There is no "fact" for you to place a label on,
because the data inside isn't related to ASCII or EBCDIC, it's just raw
binary data. (Having a hard time fitting this into the jam analogy --
maybe a jar full of water. It doesn't matter if it says "strawberry"
or
"raspberry", because it's not jam... not a great analogy, I guess.)

Then you have to deal with computer programs simply NOT KNOWING what
the
correct encoding is. A person might be able to guess at an encoding by
looking at the contents of an object -- but there's no way a computer
can do that. They lack the intelligence. They make assumptions
(defaults) about what things are -- but they expect YOU to solve the
problem.

Okay, okay... enough philosophy, let's look at your scenario.

STEP 1: You use 'touch' to create a file with CCSID 37.

Since 'touch' creates an empty file, all you've done is set the label.
(i.e. put a "strawberry" on an empty jam jar.) The system is relying
on
you to use appropriate methods to actually put EBCDIC data into the
file. (Or to use software that ignores the CCSID entirely if you're
working with binary data.) In other words: If you label the jar
"strawberry", it's still up to you to make sure you put strawberry jam,
not raspberry in the jar.

But I think you've already tested and discovered that you, indeed, had
the right data in the file... so on to the next step.

STEP 2: Use 'tar' to create a tarball.

Okay, now you need to know what the 'tar' utility is going to do with
regard to CCSID translation. Bear in mind that 'tar' was invented for
Unix, and Unix file systems don't typically store a "label" for the
CCSID of an object. So the TAR program isn't going to keep track of
your CCSID while it's inside the tarball.

I don't know whether you're using QShell or PASE to create the tarball.
I'm pretty sure we can rely on PASE to make a binary-perfect copy of
the file. I'm not so confident about QShell, since QShell is native to
IBM i, and IBM i users often create stuff in EBCDIC. Sending an EBCDIC
file in a Unix Tape Archive (tar) makes little sense -- so QShell may
be
trying to translate it to ASCII. Obviously, if your data is binary
data
(such as a Java JAR file) this would corrupt it.

Let's look at the docs for the QShell 'tar' program:
http://publib.boulder.ibm.com/infocenter/iseries/v5r4/topic/rzahz/tar.h
tm

At the bottom of the page, it says (quote):

QIBM_CCSID
The value of the environment variable is the CCSID used
to create files extracted from the archive. There must
be a valid translation from CCSID 819 to the specified
CCSID.

This description is only talking about extraction -- not creation. It
doesn't say anything about creating the data in the tarball with regard
to CCSID. Hmmm.. let's do a test:

> echo "hello" > mytest.txt
$
> od -x mytest.txt
0000000 8885 9393 9625
0000006
> tar cf hello.tar mytest.txt
$

The tarball itself is a binary document. The CCSID on that document is
irrelevant, since it does not contain pure text. The CCSID of
'mytest.txt' is 37, I verified that with WRKLNK. The hex dump of the
file (as shown above) shows that the data is indeed EBCDIC... hex 88 is
the EBCDIC code for the "h" in the first position.

Now I transfer that tarball to my FreeBSD PC, and I'm careful to use
binary mode. (Though, if I wasn't, it'd corrupt the tarball, and I
wouldn't be able to extract anything -- so using ASCII mode wouldn't
translate the contents of the file INSIDE the tarball.)

On my FreeBSD box:

$ tar xf hello.tar

$ ls -l mytest.txt
-rw-r--r-- 1 klemscot wheel 6 Aug 25 16:06 mytest.txt

$ od -x mytest.txt
0000000 6568 6c6c 0a6f
0000006

$ cat mytest.txt
hello

As you can see... I had no trouble extracting the data, but the data
is
now in ASCII, not EBCDIC! It's now viewable in FreeBSD, an ASCII
system, and the hex dump confirms that it is indeed ASCII.

The file no longer has a CCSID "label", since FreeBSD has no notion of
CCSIDs. (Nor does the FTP protocol, itself, by the way. It has no way
of communicating a CCSID from the server to my PC.) The CCSID label is
gone and irrelevant now -- but the data is OBVIOUSLY in ASCII.

Let's try it again, back in QShell, this time setting QIBM_CCSID when
building hello.tar. The Info Center says something about translating
from QIBM_CCSID to 819. When creating, does it do the opposite? If so,
setting QIBM_CCSID to 819 should stop it from translating. In QShell:

$
> rm hello.tar
$
> QIBM_CCSID=819 tar cf hello.tar mytest.txt
$

Once again, FTP it to the FreeBSD box in binary mode, and do this:

$ tar xf hello.tar
$ od -x mytest.txt
0000000 6568 6c6c 0a6f
0000006

It's *still* ASCII! QShell's tar utility seems to have decided to
translate my EBCDIC file to ASCII, and it doesn't seem to care about
QIBM_CCSID. Lovely. Let's try PASE... from QP2TERM:

$
> rm hello.tar; tar cf hello.tar mytest.txt
$

Once again, FTP the file to FreeBSD in binary mode, then do:

$ tar xf hello.tar; od -x mytest.txt
0000000 8588 9393 2596
0000006

W00t! The data is still EBCDIC as I intended. So there you go,
QShell's tar utility is translating the data to ASCII when it creates
the tarball.

STEP 3: Transfer the file to intermediate places using FTP

Since you used binary mode, I'm not worried about this part of the
transfer. If you screw up the binary contents of a tarball, it won't
extract. So this part is a non-issue.

STEP 4: Receive the tarball on the destination system via FTP

You get it, and you note that the CCSID of the tarball is now 37.
Since
a tarball isn't text, it shouldn't make any difference what it's CCSID
is... Again, the CCSID you see is just a label on the outside of a
container. It doesn't matter if you stick a "strawberry jam" label on
a
rock, since nobody is going to spread the rock on toast and try to eat
it.

Likewise, the CCSID on the tarball doesn't matter because nobody is
going to try to read it as a text file. I don't know why this one is
getting 37... Remember, the PC it's coming from has no notion of
CCSIDs, and the FTP protocol has no way of communicating CCSIDs. So
presumably, 37 is just the default that's set for new files in your FTP
server's configuration. But again, it's irrelevant.

STEP 5: Extract the data using QShell's tar command.

We already know from the Info Center, as well as experiments, that
QShell's tar program expects the contents of a tar file to be 819. Why
are the new files being created with CCSID 37? Because your QIBM_CCSID
is set to 37. But again, that's just the label -- it's not the fact.

But the IBM docs say that whatever CCSID is in QIBM_CCSID is what tar
will TRANSLATE the data to... and it's going to translate FROM 819.
Ouch. That means that a Java JAR file will certainly be corrupt --
because JAR is not text.

Granted, when the tarball was created, it translated from EBCDIC->ASCII,
and with your QIBM_CCSID set to 37, it should therefore try to
translate
back from ASCII->EBCDIC, and therefore give you the same result you
started with. However, there's a flaw in that logic... there may be
characters in CCSID 37 that don't exist in 819. Likewise, there may be
characters in CCSID 819 that don't exist in 37. Therefore, you can't
assume that an EBCDIC->ASCII translation, followed by an ASCII->EBCDIC
translation will result in the same file at the end.

Worse, if you're extracting a tar file created on a non-IBM i system
(where the data wasn't originally run through a translation table when
the tarball was created) the results will be significantly different
from the input file.

At this point, I'd test to see if I understood the IBM docs correctly -
-
but frankly, this message is taking too long to write, and it's a
pretty
much moot point, anyway.

The solution is not to use the TAR commadn in QShell, because it
doesn't
do what you expect. It treats the data as text and translates it to
ASCII -- which is not what you want. Use the PASE tar utility, or get
a
3rd party Unix tar (such as 7zip) working in PASE... Or simply use the
JAR utility to archive things. in any case, it should solve your
problem.

On 8/25/2010 3:11 PM, Dennis Lovelady wrote:

Slight variation of this topic. I primarily work with systems in two
countries: Germany and USA. On each of these systems, my profile's

CCSID is

set to 37. DSPJOB option 2 shows 37 for CCSID, 37 for Default CCSID.

When I go to into QSH on either system and type a command like "touch
new_file" then a new file will be created with CCSID 37. So far so

good.

But now I want to interchange files from one system with the other.

The

files I want to interchange (from the German system) are all CCSID

819. I

use a command like "tar cvf mytar.tar path_to_files" to package these

files.

Interestingly the tarball is created with CCSID 819. (I don't know

why this

would be, but I like it - I think.)

Now, I ftp the file to my PC. Because of security restrictions, in

order to

get the file to the USA system, I have to make a pit stop at a local

server:

FTP to intermediate server from PC. Log in to USA system and FTP

from

intermediate to me. All FTP done in strict BINARY mode.

Now the tarball is CCSID 37 on the USA system. But I'm successful
(apparently) in untarring with tar xvf.

However, the untarred files are all CCSID 37. More importantly, they

don't

operate as expected. For example, I usually do this with .JAR files,

and I

invariably end up with complaints from JAVA about the central

directory

being corrupted or some such. I've also tried certain binaries (such

as the

AIX ZIP/UNZIP suite), and those don't run either. The QSH command
"setccsid" doesn't seem to help (although it does set the CCSID of

the

file), and I'm quite confused as to why this would be.

Any pointers?

Dennis Lovelady
http://www.linkedin.com/in/dennislovelady
--
"A lawyer is a learned gentleman who rescues your estate from your
enemies... and keeps it for himself."
-- Henry Broughman

--
This is the RPG programming on the IBM i / System i (RPG400-L) mailing
list
To post a message email: RPG400-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/rpg400-l
or email: RPG400-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/rpg400-l.