Re: CSVR4 and UTF-8 -- RPG400-L

Yep, it only gets you closer when reading it - your 'RIGHT SINGLE QUOTATION' would still not be handled correctly.

And you %xlate call might corrupt existing UTF-8 entities, if any of the bytes in the UTF-8 character's byte-string are in the range x'00' to x'3F' and x'41'.

Regards
Vern

On 5/5/2020 11:00 AM, Greg Wilburn wrote:

Changing the attribute didn't work... I changed it to 1208 and the parsing portion of my program failed.

-----Original Message-----
From: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] On Behalf Of Vernon Hamberg
Sent: Tuesday, May 05, 2020 10:51 AM
To: RPG programming on IBM i <rpg400-l@xxxxxxxxxxxxxxxxxx>
Subject: Re: CSVR4 and UTF-8

Seems CHGATR is the command to use - but the original file is in UTF-8,
according to Greg below. So the VALUE here should probably be 1208, if I
understand this a little.

On a little bit different question here - I don't remember if UTF-8 byte
strings use any of the hex values in NONDSP here - they might, since it
might use bytes from X'20' on.

Using %xlate assumes the content of the IFS file is in EBCDIC, right?

Enough for now! I'm very interested in this topic, as we are facing
Unicode issues here.

Vern

On 5/5/2020 8:36 AM, Eiresoft Software Solutions wrote:

Hi, not sure if this helps but have you tried converting 819 to 273 :

CHGATR OBJ('/tmp/test.csv') ATR(*CCSID) VALUE(273)

regards

Am 2020-05-05 15:05, schrieb Greg Wilburn:

Vern,

That is exactly correct... the CCSID of the file is 819. I'm not sure
how that is assigned other than the JOB ccsid that retrieves the file
via FTP (I have a CLP that runs a script saved in DDS text).

These are orders coming from a very popular website platform... an app
installed on the platform extracts orders and pushes them to our FTP
server for pickup. The app has a function called normalize_text() to
remove characters. Unfortunately, you can't apply it to the entire
tab-delimited string. It has to be applied to each-and-every field
we've selected for download.

EASY400 has a utility called CVT101 that contains StmfCvt to convert
from one to another, but that likely won't help.

I was hoping for an "easier" solution.

My service program (someone helped me with this too) uses %XLATE
against a constant that contains the hex representation of many
non-display characters.

       dcl-c nondsp const(x'000102030405060708090A0B0C0D0E0F-
101112131415161718191A1B1C1D1E1F-
202122232425262728292A2B2C2D2E2F-
303132333435363738393A3B3C3D3E3F-
                                           41');
       dcl-s space                char(%size(nondsp));
       dcl-s result               varchar(1024);

         result = %trim(inchar);
         result = %xlate(nondsp:space:result);

         return result;

I think I also tried the SQL Translate() function, but RPG seemed a
bit faster.

Greg

-----Original Message-----
From: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] On Behalf
Of Vernon Hamberg
Sent: Tuesday, May 05, 2020 2:59 AM
To: RPG programming on IBM i <rpg400-l@xxxxxxxxxxxxxxxxxx>
Subject: Re: CSVR4 and UTF-8

Thomas - take a look at U+2018 and U+2019 and U+201C and U+201D. U+2019
in that page - a great site - is RIGHT SINGLE QUOTATION MARK.

U+2018     ‘     e2 80 98     LEFT SINGLE QUOTATION MARK
U+2019     ’     e2 80 99     RIGHT SINGLE QUOTATION MARK
U+201A     ‚     e2 80 9a     SINGLE LOW-9 QUOTATION MARK
U+201B     ‛     e2 80 9b     SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+201C     “     e2 80 9c     LEFT DOUBLE QUOTATION MARK
U+201D     ”     e2 80 9d     RIGHT DOUBLE QUOTATION MARK

These are among several characters that don't exist in EBCDIC - an
ellipsis or em dash. We had run into this challenge with text entered on
an iPhone app. And these can come form the Autocorrect options in MS
Word or Outlook.

I ended up using SQL on i to import the text, and it puts X'3F' where it
encounters characters it can't convert.

Greg asks about recommendations - maybe the only 1 to retain the
characters would be to use 1208 (UTF-8) as the CCSID - of course, that
can't be done without a whole lot of work.

And IBM do not provide conversion tables between UTF-8 and EBCDIC - how
could you, at least in their present form.

Does iconv have options to convert these typographer (another descriptor
of these things) characters into something like EBCDIC?

Cheers
Vern

On 5/5/2020 1:02 AM, Thomas Raddatz wrote:

I do not know what you mean with ' right single quotation mark '. I
assume it is a ACUTE ACCENT or a GRAVE ACCENT according to UTF8
table https://www.utf8-chartable.de/.

I did a brief test with service program CSVR4 and the following test
data on our IBM i:

"ABC123","Scott Klement","123 Sesame St","Milwaukee,
WI","USA","","53132-1234",1000.00
"ABC123","Bärbel Böhm","Some Street","Some
City","Germany","","40721",1000.00
"ABC123","`Jürgen` ´Bärbeißer´","Some Street","Some
City","Germany","","40721",1000.00

The report produced by CSVDEMO shows the result expected:

File . . . . . :   QSYSPRT
Control . . . . .
Find . . . . . .
*...+....1....+....2....+....3...
Acct        Name
---------- ---------------------
ABC123      Scott Klement
ABC123      Bärbel Böhm
ABC123      `Jürgen` ´Bärbeißer´

The German Umlaute as well as the ACUTE ACCENT and GRAVE ACCENT are
correctly printed. Hence I assume that CSVR4 works fine.

We do not use CSVR4, so a brief test is all I can do.

Did you check the CCSID of your inpput? Is it 1208 (= UTF8)?

Thomas.

-----Ursprüngliche Nachricht-----
Von: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] Im
Auftrag von Greg Wilburn
Gesendet: Montag, 4. Mai 2020 16:24
An: RPG400-L@xxxxxxxxxxxxxxxxxx
Betreff: CSVR4 and UTF-8

I have a program that is using the CSVR4 service programs to read
tab delimited text files that we pull down from a website. The site
is using UTF-8 character set... occasionally, we have issues with
character translation.

Example: x'e2 80 99' (right single quotation mark) makes a real
mess of the customer's name.

I have a utility that removes non-display characters, but in this
case I need to keep the character.

Any recommendations on changes that could be made to the process
that would eliminate some of these translation issues?

Thanks,
Greg

--
This is the RPG programming on IBM i (RPG400-L) mailing list
To post a message email: RPG400-L@xxxxxxxxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: https://lists.midrange.com/mailman/listinfo/rpg400-l
or email: RPG400-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives
at https://archive.midrange.com/rpg400-l.

Please contact support@xxxxxxxxxxxxxxxxxxxx for any subscription
related questions.

Help support midrange.com by shopping at amazon.com with our affiliate
link: https://amazon.midrange.com