Re: Importing UTF-8 multibytes file -- MIDRANGE-L

+1 to this

UTF-8 can use 1, 2, 3, or 4 bytes to represent a character, even emojis - there's no way to copy it back and forth between 1208 and 65535 or 37 or whatever EBCDIC you use in your country.

Even with 65535, I believe everything is handled using single bytes, so it won't work for copying both ways. It would be interesting to see, of course, whether the raw bytes are copied over. But presenting the data out of 65535, IF all the bytes are copied, would be impossible, due to the different lengths of the character representation. Or do you want to write your own program to process UTF-8? I'm smiling as I suggest that - no way I want to.

There are things, like emojis, that will not convert into any of the EBCDIC character sets. We use SQL's XML/JSON functions to import the data - anything it doesn't recognize is converted to X'3F', I believe. Does RPG recognize UTF-8 now? I haven't checked for awhile. It has recognized UTF-16 and UCS-2.

Cheers
Vern

On 2/4/2022 8:12 AM, Stephen Landess wrote:

Maria -

I spent the last 15 years working in a multinational JDE shop with 54 different environments comprising most of the major languages and character sets in the world.

I was surprised how little information was available about character conversions in forums such as Midrange-L and Midrange-RPG when I first started working there in 2006. I finally found Scott Klement's web site and found a wealth of information from him.

If you have multinational character set data in the IFS file (i.e., data from different countries which have varying character sets), then the safest way to handle it is to create a new file on the IBM i with CCSID(1208) {UTF-8}, CCSID(1200) {UFT-16}, or CCSID(13488) {UCS2} and use CPYFRMIMPF to copy the data from the IFS file to the IBM i file using the appropriate from and to CCSID's and use the data in the new file in your applications. This may require using *UCS fields in RPG programs.

If the IFS file is data from a particular country, then the CCSID 1208 data be converted to EBCDIC into your current file by using the appropriate EBCDIC CCSID as the TOCCSID() in CPYFRMIMPF, and the OS will convert the data from 1208 to EBCDIC. However, when the file is defined using CCSID(65535) you'll need to set the job CCSID to match the EBCDIC CCSID of the data in order to use it...

Feel free to call me if you need further information.

Regards,
Steve Landess
512-289-0387

Maria wrote:

Hi all,
Hope you are all fine!
For a new customer of mine, I need to import into IBM i running V7R3 a
UTF-8 multibyte file. So, the UTF-8 multibyte file is on IFS and has CCSID set to 1208, whereas the DB2 flat file is set to CCSID = 655535.
The UTF-8 file is multibyte because it may contain worldwide addresses.

No matter what command do I use, be it FTP or CPYFRMIMPF,
the hexadecimal correspondence for a multibyte character (i.e. C5A0)
is just one byte (in our example is 3F);
right after, if I FTP the same DB2 flat file back to IFS, using TYPE C 1208,
the resulting file is different from the original one:
while in the original file there is a multibyte (i.e. C5A0) now there is a single byte (1A).
In your opinion and experience, is there a way to import and then export a
UTF-8 file to and from the Power system so that the resulting file is the same as the original one?
Should I really be obliged to read the UTF-8 file from the IFS byte per byte and make a conversion, sort of?
I already made an unsuccessful search in the mailing list and I resolve to ask, because I am pretty sure this is a common issue.