Re: CSVR4 and UTF-8 -- RPG400-L

Hi Scott,

Thank you for the update. Of course %str() is useless with UCS-2, because a whole bunch of characters start with x'00'. But scanning for x'0000' does not work too, does it?

Please correct me if I am wrong, but for what I can see, there is no x'0000' in my second example where I used a CSV file with line delimter LF:

00090 002C0031 00300030 0030002E 00300030 - ................
000A0 000A0040 40404040 40404040 40404040 - ...
---- why is LF = x'000A'?

All I can see is x'00A0', which could be the LF followed by another x'00'. I assume that this x'00' is the nul terminator added by fgets(). So scanning for x'00A0' could work, if the was not my first example with a CSV file with line delimiter CRLF. In this case the line ended with:

00090 002C0031 00300030 0030002E 00300030 - ................
000A0 00000A00 40404040 40404040 40404040 - ....
-- where does this byte belong to???
and why is LF = x'000A'?

In this example there is an unexpected x'00' before x'00A0', so that we would get an extra x'00' byte if we scanned for x'00A0'. Looks like a bug in fgets().

I did not yet look at a file with line delimiter CR.

What am I missing?

Regards,

Thomas.

PS: I do not use your CSVR4 tool, but I agree, that UCS-2 support would be great.

-----Ursprüngliche Nachricht-----
Von: RPG400-L [mailto:rpg400-l-bounces@xxxxxxxxxxxxxxxxxx] Im Auftrag von Scott Klement
Gesendet: Dienstag, 5. Mai 2020 21:05
An: rpg400-l@xxxxxxxxxxxxxxxxxx
Betreff: Re: CSVR4 and UTF-8

Thomas,

If your goal is to have fgets() return data in UCS-2, then you would not
be able to use %STR, because %STR is not expecting UCS-2.

Instead, you could manually scan the buffer (via %SCAN) for the x'0000'
terminator, and then use something like memcpy or %SUBST (with a deref
pointer to UCS2 field) to copy the appropriate length into an RPG
VARUCS2 field.

The OP would still run into the same problem when converting the UCS-2
field to EBCDIC, but he'd be able to do scan/replace prior to that in
his own code.

Sounds like he might be using my CSVR4 tool, so maybe I should add UCS-2
support...

-SK

On 5/5/2020 1:22 PM, Tools/400 wrote:

Hi Vern,

One thing that could actually work is reading the data in ccsid UCS-2
(13488). That covers almost all character (or maybe even all
characters) of UTF-8. Then Greg could replace e.g. the SINGLE
QUOTATION MARKS by a charcater that can be translated to EBCDIC, for
example a SINGLE QUOTE ('). He could do the same for other characters
that have an almost equal counterpart in EBCDIC.

Once that is done, he could translate the UCS-2 string to EBCDIC.
Basically that worked for me (quick and dirty test). But there is a
problem I do not understand. The key is that %str() does not work for
determing the length of the buffer returned by fgets().

Here are some examples returned by fgets():

EOL charcater: CRLF, o_CCSID=13488
00000     FFDE0022 00410042 00430031 00320033   - █ú... .â.{......
00010     0022002C 00222019 004200E4 00720062   - .........â.U.Ê.Â
00020     0065006C 20190020 004200F6 0068006D   - .Á.%.....â.6.Ç._
00030     0022002C 00220053 006F006D 00650020   - .......ë.?._.Á..
00040     00530074 00720065 00650074 0022002C   - .ë.È.Ê.Á.Á.È....
00050     00220053 006F006D 00650020 00430069   - ...ë.?._.Á...{.Ñ
00060     00740079 0022002C 00220047 00650072   - .È.`.......å.Á.Ê
00070     006D0061 006E0079 0022002C 00220022   - ._./.>.`........
00080     002C0022 00340030 00370032 00310022   - ................
00090     002C0031 00300030 0030002E 00300030   - ................
000A0     00000A00 40404040 40404040 40404040   - ....
          -- where does this byte belong to???
             and why is LF = x'000A'?

EOL character: LF, o_CCSID=13488
00000     FFDE0022 00410042 00430031 00320033   - █ú... .â.{......
00010     0022002C 00222019 004200E4 00720062   - .........â.U.Ê.Â
00020     0065006C 20190020 004200F6 0068006D   - .Á.%.....â.6.Ç._
00030     0022002C 00220053 006F006D 00650020   - .......ë.?._.Á..
00040     00530074 00720065 00650074 0022002C   - .ë.È.Ê.Á.Á.È....
00050     00220053 006F006D 00650020 00430069   - ...ë.?._.Á...{.Ñ
00060     00740079 0022002C 00220047 00650072   - .È.`.......å.Á.Ê
00070     006D0061 006E0079 0022002C 00220022   - ._./.>.`........
00080     002C0022 00340030 00370032 00310022   - ................
00090     002C0031 00300030 0030002E 00300030   - ................
000A0     000A0040 40404040 40404040 40404040   - ...
          ---- why is LF = x'000A'?

EOL character: CRLF, o_CCSID=0
00000     FFDE7FC1 C2C3F1F2 F37F6B7F 3FC2C099   - █ú"ABC123",".Bär
00010     8285933F 40C26A88 947F6B7F E2969485   - bel. Böhm","Some
00020     40E2A399 8585A37F 6B7FE296 948540C3   - Street","Some C
00030     89A3A87F 6B7FC785 99948195 A87F6B7F   - ity","Germany","
00040     7F6B7FF4 F0F7F2F1 7F6BF1F0 F0F04BF0   - ","40721",1000.0
00050     F0250040 40404040 40404040 40404040   - 0..
            -- that is fine. LF = x'25' in EBCDIC

I do not understand the last bytes of example 1 and 2. CRLF in UCS-2
should be x'240D' (CR) and x'240A' (LF), shouldn't it?

How could we get the length of the buffer returned by fgets()?

Thomas.

upport midrange.com by shopping at amazon.com with our affiliate link: https://amazon.midrange.com