RE: CSVR4 and UTF-8 -- RPG400-L

Scott,

Wow, I'm not sure how to thank you... I think I speak for our entire community when I say "Thank You".

I'll give this a try today.

Thx,
Greg

-----Original Message-----
From: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] On Behalf Of Scott Klement
Sent: Tuesday, May 05, 2020 11:54 PM
To: rpg400-l@xxxxxxxxxxxxxxxxxx
Subject: Re: CSVR4 and UTF-8

Hi Greg,

I've updated the CSVR4 utility on my web site:

http://www.scottklement.com/csv/

(Be careful that your browser hasn't cached the old copy when you
re-download it.)

The CSV_open() function now has an additional parameter that lets you
tell CSVR4 to do its internal processing using Unicode fields. You
would call it like this:

file = CSV_open('/path/to/ifs.csv': *omit:*omit:*omit: *on);

Passing *ON in the final parameter enables the internal Unicode support,
with that it will convert your file (assuming it is properly labelled as
CCSID 1208) into CCSID 1200 (UTF-16 Unicode) for internal processing
within CSVR4.

There is also a new function called CSV_getFldUni() which returns a
Unicode field (vs. CSV_getFld which returns EBCDIC).   You would call it
like this:

D name            s             50c   varying ccsid(1200)
   .
   .
      dow CSV_loadRec(file);

         CSV_getFldUni(file: name: %size(name));
         .
         . ...other fields here, etc...
         .
      enddo;

This should be easy to understand if you're already familair with
CSVR4. The only difference between CSV_getFld and CSV_getFldUni is that
the latter returns teh data into an RPG CCSID(1200) type 'C' (VARUCS2)
field.

Since the field is now in Unicode, your code should run without errors.
Then you could use %XLATE to convert the curly quotes to straight quotes
-- which do exist in EBCDIC -- to solve your problem.

There's a full example named CSVDEMO4 included in the download that
shows reading the file in Unicode, replacing the quotes, and converting
the fixed-up field into EBCDIC so that it can be printed.

Good luck!

-SK

On 5/5/2020 3:50 PM, Greg Wilburn wrote:

Ok... I don't pretend to understand this, but...

The file is UTF-8 when downloaded to a PC via the browser and viewed in Notepad++.
On the IBM i, I'm downloading this with a CLP that overrides INPUT from a source member containing the script, then issues the FTP RMTSYS(server ip). The file is assigned CCSID 819. I don't know how to change that.

If I do a CHGATR OBJ(myorder.txt) ATR(*CCSID) VALUE(1208), the file "looks better" when viewed with WRKLNK. The hex (E2 80 99) is still there, but is displayed as a single blank space.

But when I use CSV_open() on that file, I receive CPE3490
Message . . . . : Conversion error.
Cause . . . . . : One or more characters could not be converted from the
source CCSID to the target CCSID.
Technical description . . . . . . . . : Change the input string to only
contain valid characters for the specified source CCSID. If this message is
coming from an application, report the message to the service provider for
your application.

The CVT101 utility isn't working, or I'm not using it correctly.

This has gone way further than I anticipated... We use these addresses to ship an order, then delete them overnight. FWIW, I'm pretty sure the shipping software strips out everything except alphanumeric anyway.

BTW - the web platform is Shopify if anyone cares.

-----Original Message-----
From: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] On Behalf Of Scott Klement
Sent: Tuesday, May 05, 2020 3:50 PM
To: rpg400-l@xxxxxxxxxxxxxxxxxx
Subject: Re: CSVR4 and UTF-8

Greg,

Yeah, for sure it'd be easier if they made sure the people keying the
stuff in did it according to some sort of standard for quality.

I can't help you with the fact that your file transfer process is
marking the CCSID wrong. (Actually, its probably not marking it at all,
but taking the default, and the default for your system is probably 819
-- which is ISO-8859-1, an old flavor of ASCII.)

And there's nothing I can do about the fact that EBCDIC doesn't support
the "fancy quote" characters that people use. (Honestly, these
characters aren't on the keyboard -- so the only reason people use them
is because companies like Microsoft think it's helpful to automatically
convert the quote marks to the "fancy" ones.)

But, what I could do is allow CSVR4 to return data in a different
character set besides just EBCDIC.

If you can get the output in UTF-16 (i.e. a UCS-2 type RPG field with
CCSID 1200) then you could at least do something to work with the
unexpected characters before putting them into your ebcdic fields.   If
I don't make a change like that, then how else could you handle it? I
suppose could switch to using some other software, but... from my
perspective, that's not good, it makes someone else's software more
useful than mine... not what I want.

-SK

On 5/5/2020 2:36 PM, Greg Wilburn wrote:

Scott,

Yes, I'm using your CSVR4 tool to read IFS files containing orders that I'm retrieving from an FTP server. The data files are generated on a popular web platform. I'm definitely not looking for you to do extra work on my behalf.

I just made a "donation" to easy400 to download CVT101 to try. It seems like this utility might do the trick. It's a service pgm that uses iconv to convert from one code page to another.

I also considered creating a database file and using CPYFRMIMPF - although that didn't seem like it did much for me. In a test, I changed the file attributes to CCSID 1208 and did the CPYFRMIMPF. It did get rid of the goofy characters.

To be honest, I am really disappointed in the quality of data coming from a web platform of this caliber. These guys do NO formatting or address validation of any kind. They allow people to key anything into their own shipping address. IMHO this is inexcusable given the tools available via web services today.
(rant over).

Thanks,
Greg

-----Original Message-----
From: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] On Behalf Of Scott Klement
Sent: Tuesday, May 05, 2020 3:05 PM
To: rpg400-l@xxxxxxxxxxxxxxxxxx
Subject: Re: CSVR4 and UTF-8

Thomas,

If your goal is to have fgets() return data in UCS-2, then you would not
be able to use %STR, because %STR is not expecting UCS-2.

Instead, you could manually scan the buffer (via %SCAN) for the x'0000'
terminator, and then use something like memcpy or %SUBST (with a deref
pointer to UCS2 field) to copy the appropriate length into an RPG
VARUCS2 field.

The OP would still run into the same problem when converting the UCS-2
field to EBCDIC, but he'd be able to do scan/replace prior to that in
his own code.

Sounds like he might be using my CSVR4 tool, so maybe I should add UCS-2
support...

-SK

On 5/5/2020 1:22 PM, Tools/400 wrote:

Hi Vern,

One thing that could actually work is reading the data in ccsid UCS-2
(13488). That covers almost all character (or maybe even all
characters) of UTF-8. Then Greg could replace e.g. the SINGLE
QUOTATION MARKS by a charcater that can be translated to EBCDIC, for
example a SINGLE QUOTE ('). He could do the same for other characters
that have an almost equal counterpart in EBCDIC.

Once that is done, he could translate the UCS-2 string to EBCDIC.
Basically that worked for me (quick and dirty test). But there is a
problem I do not understand. The key is that %str() does not work for
determing the length of the buffer returned by fgets().

Here are some examples returned by fgets():

EOL charcater: CRLF, o_CCSID=13488
00000     FFDE0022 00410042 00430031 00320033   - █ú... .â.{......
00010     0022002C 00222019 004200E4 00720062   - .........â.U.Ê.Â
00020     0065006C 20190020 004200F6 0068006D   - .Á.%.....â.6.Ç._
00030     0022002C 00220053 006F006D 00650020   - .......ë.?._.Á..
00040     00530074 00720065 00650074 0022002C   - .ë.È.Ê.Á.Á.È....
00050     00220053 006F006D 00650020 00430069   - ...ë.?._.Á...{.Ñ
00060     00740079 0022002C 00220047 00650072   - .È.`.......å.Á.Ê
00070     006D0061 006E0079 0022002C 00220022   - ._./.>.`........
00080     002C0022 00340030 00370032 00310022   - ................
00090     002C0031 00300030 0030002E 00300030   - ................
000A0     00000A00 40404040 40404040 40404040   - ....
          -- where does this byte belong to???
             and why is LF = x'000A'?

EOL character: LF, o_CCSID=13488
00000     FFDE0022 00410042 00430031 00320033   - █ú... .â.{......
00010     0022002C 00222019 004200E4 00720062   - .........â.U.Ê.Â
00020     0065006C 20190020 004200F6 0068006D   - .Á.%.....â.6.Ç._
00030     0022002C 00220053 006F006D 00650020   - .......ë.?._.Á..
00040     00530074 00720065 00650074 0022002C   - .ë.È.Ê.Á.Á.È....
00050     00220053 006F006D 00650020 00430069   - ...ë.?._.Á...{.Ñ
00060     00740079 0022002C 00220047 00650072   - .È.`.......å.Á.Ê
00070     006D0061 006E0079 0022002C 00220022   - ._./.>.`........
00080     002C0022 00340030 00370032 00310022   - ................
00090     002C0031 00300030 0030002E 00300030   - ................
000A0     000A0040 40404040 40404040 40404040   - ...
          ---- why is LF = x'000A'?

EOL character: CRLF, o_CCSID=0
00000     FFDE7FC1 C2C3F1F2 F37F6B7F 3FC2C099   - █ú"ABC123",".Bär
00010     8285933F 40C26A88 947F6B7F E2969485   - bel. Böhm","Some
00020     40E2A399 8585A37F 6B7FE296 948540C3   - Street","Some C
00030     89A3A87F 6B7FC785 99948195 A87F6B7F   - ity","Germany","
00040     7F6B7FF4 F0F7F2F1 7F6BF1F0 F0F04BF0   - ","40721",1000.0
00050     F0250040 40404040 40404040 40404040   - 0..
            -- that is fine. LF = x'25' in EBCDIC

I do not understand the last bytes of example 1 and 2. CRLF in UCS-2
should be x'240D' (CR) and x'240A' (LF), shouldn't it?

How could we get the length of the buffer returned by fgets()?

Thomas.

Am 05.05.2020 um 18:07 schrieb Vernon Hamberg:

I agree, Thomas.

UTF-8 has distinct byte patterns that identify them as UTF-8 - single
bytes go up to 7F, that's the Basic Latin group on that site.

One can scan through a UTF-8 text. Whenever you encounter a byte
greater or equal to X'80', you have a UTF-8 character - some are 2
bytes, some 3, some 4. Those of 2 bytes start with x'C?', etc.

Greg mentions that he changed the file CCSID to 1208, and WRKLNK
handles it better - I saw that here when faced with emojis from
people's iPhones. Our solution at the time - 3 years ago - was to use
SQL's XMLTABLE - it knows what it doesn't know and gives you a
placeholder - one of our developers looks like he wants to use
substitute strings - I don't know how you keep up with that, but
maybe he can. UTF-8 is such a moving target in some respects,
especially emoji-like things. Do you replace the MOUSE FACE with the
words in the message?

Regards
Vern

On 5/5/2020 8:40 AM, Thomas Raddatz wrote:

Hi Vern,

I did not notice the drop down for switching between the different
code planes of UTF-8. Hence I did not find the RIGHT SINGLE
QUOTATION MARK earlier.

x'F3' is the same that I get for RIGHT SINGLE QUOTATION MARK when I
call CSVDEMO for a VSC file that contains that character. So that is
the same a SQL.

I do not tink there is a nice solution for Greg. The only thing I
have in mind is reading the CSV data as UTF8 and then replace
character not available in EBCDIC. At the end convert the resulting
UTF8 string to EBCDIC and continue with splitting the row into parts.

Thomas

-----Ursprüngliche Nachricht-----
Von: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] Im
Auftrag von Vernon Hamberg
Gesendet: Dienstag, 5. Mai 2020 08:59
An: RPG programming on IBM i
Betreff: Re: CSVR4 and UTF-8

Thomas - take a look at U+2018 and U+2019 and U+201C and U+201D. U+2019
in that page - a great site - is RIGHT SINGLE QUOTATION MARK.

U+2018     ‘     e2 80 98     LEFT SINGLE QUOTATION MARK
U+2019     ’     e2 80 99     RIGHT SINGLE QUOTATION MARK
U+201A     ‚     e2 80 9a     SINGLE LOW-9 QUOTATION MARK
U+201B     ‛     e2 80 9b     SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+201C     “     e2 80 9c     LEFT DOUBLE QUOTATION MARK
U+201D     ”     e2 80 9d     RIGHT DOUBLE QUOTATION MARK

These are among several characters that don't exist in EBCDIC - an
ellipsis or em dash. We had run into this challenge with text
entered on
an iPhone app. And these can come form the Autocorrect options in MS
Word or Outlook.

I ended up using SQL on i to import the text, and it puts X'3F'
where it
encounters characters it can't convert.

Greg asks about recommendations - maybe the only 1 to retain the
characters would be to use 1208 (UTF-8) as the CCSID - of course, that
can't be done without a whole lot of work.

And IBM do not provide conversion tables between UTF-8 and EBCDIC - how
could you, at least in their present form.

Does iconv have options to convert these typographer (another
descriptor
of these things) characters into something like EBCDIC?

Cheers
Vern

On 5/5/2020 1:02 AM, Thomas Raddatz wrote:

I do not know what you mean with ' right single quotation mark '. I
assume it is a ACUTE ACCENT or a GRAVE ACCENT according to UTF8
table https://www.utf8-chartable.de/.

I did a brief test with service program CSVR4 and the following
test data on our IBM i:

"ABC123","Scott Klement","123 Sesame St","Milwaukee,
WI","USA","","53132-1234",1000.00
"ABC123","Bärbel Böhm","Some Street","Some
City","Germany","","40721",1000.00
"ABC123","`Jürgen` ´Bärbeißer´","Some Street","Some
City","Germany","","40721",1000.00

The report produced by CSVDEMO shows the result expected:

File . . . . . :   QSYSPRT
Control . . . . .
Find . . . . . .
*...+....1....+....2....+....3...
Acct        Name
---------- ---------------------
ABC123      Scott Klement
ABC123      Bärbel Böhm
ABC123      `Jürgen` ´Bärbeißer´

The German Umlaute as well as the ACUTE ACCENT and GRAVE ACCENT are
correctly printed. Hence I assume that CSVR4 works fine.

We do not use CSVR4, so a brief test is all I can do.

Did you check the CCSID of your inpput? Is it 1208 (= UTF8)?

Thomas.

-----Ursprüngliche Nachricht-----
Von: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] Im
Auftrag von Greg Wilburn
Gesendet: Montag, 4. Mai 2020 16:24
An: RPG400-L@xxxxxxxxxxxxxxxxxx
Betreff: CSVR4 and UTF-8

I have a program that is using the CSVR4 service programs to read
tab delimited text files that we pull down from a website. The
site is using UTF-8 character set... occasionally, we have issues
with character translation.

Example: x'e2 80 99' (right single quotation mark) makes a real
mess of the customer's name.

I have a utility that removes non-display characters, but in this
case I need to keep the character.

Any recommendations on changes that could be made to the process
that would eliminate some of these translation issues?

Thanks,
Greg