Re: CSVR4 and UTF-8 -- RPG400-L

Greg, check my reply to Thomas' post just before yours - no promise of a solution, just some of what I discovered when facing this question here.

Vern

On 5/5/2020 9:17 AM, Greg Wilburn wrote:

So I changed the attributes of the file from 819 to 1208. It displays better in WRKLNK. The Right Single Quote is displayed as a blank instead of the other characters.

I'm wondering if this will help (at least some).

Rather than change the attribute on each file after I've downloaded, what if I added CHGJOB CCSID(1208) in the CL that calls my FTP script? Would that change the CCSID of the file received?

Sorry... this stuff is always very confusing to me.

-----Original Message-----
From: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] On Behalf Of Thomas Raddatz
Sent: Tuesday, May 05, 2020 9:41 AM
To: RPG programming on IBM i <rpg400-l@xxxxxxxxxxxxxxxxxx>
Subject: Re: CSVR4 and UTF-8

Hi Vern,

I did not notice the drop down for switching between the different code planes of UTF-8. Hence I did not find the RIGHT SINGLE QUOTATION MARK earlier.

x'F3' is the same that I get for RIGHT SINGLE QUOTATION MARK when I call CSVDEMO for a VSC file that contains that character. So that is the same a SQL.

I do not tink there is a nice solution for Greg. The only thing I have in mind is reading the CSV data as UTF8 and then replace character not available in EBCDIC. At the end convert the resulting UTF8 string to EBCDIC and continue with splitting the row into parts.

Thomas

-----Ursprüngliche Nachricht-----
Von: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] Im Auftrag von Vernon Hamberg
Gesendet: Dienstag, 5. Mai 2020 08:59
An: RPG programming on IBM i
Betreff: Re: CSVR4 and UTF-8

Thomas - take a look at U+2018 and U+2019 and U+201C and U+201D. U+2019
in that page - a great site - is RIGHT SINGLE QUOTATION MARK.

U+2018 ‘ e2 80 98 LEFT SINGLE QUOTATION MARK
U+2019 ’ e2 80 99 RIGHT SINGLE QUOTATION MARK
U+201A ‚ e2 80 9a SINGLE LOW-9 QUOTATION MARK
U+201B ‛ e2 80 9b SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+201C “ e2 80 9c LEFT DOUBLE QUOTATION MARK
U+201D ” e2 80 9d RIGHT DOUBLE QUOTATION MARK

These are among several characters that don't exist in EBCDIC - an
ellipsis or em dash. We had run into this challenge with text entered on
an iPhone app. And these can come form the Autocorrect options in MS
Word or Outlook.

I ended up using SQL on i to import the text, and it puts X'3F' where it
encounters characters it can't convert.

Greg asks about recommendations - maybe the only 1 to retain the
characters would be to use 1208 (UTF-8) as the CCSID - of course, that
can't be done without a whole lot of work.

And IBM do not provide conversion tables between UTF-8 and EBCDIC - how
could you, at least in their present form.

Does iconv have options to convert these typographer (another descriptor
of these things) characters into something like EBCDIC?

Cheers
Vern

On 5/5/2020 1:02 AM, Thomas Raddatz wrote:

I do not know what you mean with ' right single quotation mark '. I assume it is a ACUTE ACCENT or a GRAVE ACCENT according to UTF8 table https://www.utf8-chartable.de/.

I did a brief test with service program CSVR4 and the following test data on our IBM i:

"ABC123","Scott Klement","123 Sesame St","Milwaukee, WI","USA","","53132-1234",1000.00
"ABC123","Bärbel Böhm","Some Street","Some City","Germany","","40721",1000.00
"ABC123","`Jürgen` ´Bärbeißer´","Some Street","Some City","Germany","","40721",1000.00

The report produced by CSVDEMO shows the result expected:

File . . . . . : QSYSPRT
Control . . . . .
Find . . . . . .
*...+....1....+....2....+....3...
Acct Name
---------- ---------------------
ABC123 Scott Klement
ABC123 Bärbel Böhm
ABC123 `Jürgen` ´Bärbeißer´

The German Umlaute as well as the ACUTE ACCENT and GRAVE ACCENT are correctly printed. Hence I assume that CSVR4 works fine.

We do not use CSVR4, so a brief test is all I can do.

Did you check the CCSID of your inpput? Is it 1208 (= UTF8)?

Thomas.

-----Ursprüngliche Nachricht-----
Von: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] Im Auftrag von Greg Wilburn
Gesendet: Montag, 4. Mai 2020 16:24
An: RPG400-L@xxxxxxxxxxxxxxxxxx
Betreff: CSVR4 and UTF-8

I have a program that is using the CSVR4 service programs to read tab delimited text files that we pull down from a website. The site is using UTF-8 character set... occasionally, we have issues with character translation.

Example: x'e2 80 99' (right single quotation mark) makes a real mess of the customer's name.

I have a utility that removes non-display characters, but in this case I need to keep the character.

Any recommendations on changes that could be made to the process that would eliminate some of these translation issues?

Thanks,
Greg