× The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.



I think it's a reasonable approach - your very own UTF-8 parser.

I was looking at a UTF-8 table - the 1st 2 LH (high?) bits of the 1st byte in the byte string are 11.

It looks like every subsequent byte, whether 1, 2, or 3 of them has 10 in those bits.

You can also tell from the 1st byte the number of bytes -

110 is 2 bytes
1110 is 3 bytes
11110 is 4 bytes

I got this from the UTF-8 wiki at https://en.wikipedia.org/wiki/UTF-8

Maybe that can help optimize a little or maybe not, don't know - it might be a smidge more complicated to replace 4 bytes with 1 blank.

Anyhow, continued good luck with this.

Vern

On 5/18/2020 6:40 PM, smith5646midrange@xxxxxxxxx wrote:
I just wanted to take a minute to follow up and provide the end product
details.

I know many did not agree with this but the client's design was to blank the
entire field if it contained any multibyte characters. There are three
fields that might contain multibyte UTF-8 characters.

The file in the IFS remains 819. It is a .csv file so I'm using CPYFRMIMPF
to load the DB2 file. The three fields that might contain multibyte UTF-8
characters were defined as CCSID 65535 in the DB2. The program examines
each byte to see if the two high bits are on. This identifies the character
as the starting character of a multibyte character. If a multibyte
character is found, I load blanks. If a multibyte character is not found, I
convert the field from 819 to 37 and put the converted value in the file.

My only question now is if someone knows something about UTF-8 multibyte
characters that is different from what I stated above with the high two
bits. Every page I found on the internet stated this was how they are
identified.

Thanks everyone for all of your help.

-----Original Message-----
From: MIDRANGE-L <midrange-l-bounces@xxxxxxxxxxxxxxxxxx> On Behalf Of Vernon
Hamberg
Sent: Monday, May 18, 2020 1:50 PM
To: Midrange Systems Technical Discussion <midrange-l@xxxxxxxxxxxxxxxxxx>
Subject: Re: Chinese characters in an interface file

Is the FTP transfer initiated from the client? The client can use this
subcommand -

quote type c 1208

I just tried that, transferred a simple text file, and it was created with
1208 ccsid.

If the client can't do that, you can either, as others have suggested (look
at Barbara Morris' post or posts), use CHGATR on the IFS file to make it
1208, or maybe use the CPY command to copy the original, specify
1208 on the TO... parameter - the TOSTMF must not exist.

Again, look at some other posts, esp. Barbara's, I recall she responded to
my thoughts, too - she writes the RPG compiler.

HTH
Vern

On 5/18/2020 11:03 AM, smith5646midrange@xxxxxxxxx wrote:
I don't know how to create the file on the IFS with CCSID when the
file is created via the FTP command. Is there something I don't know
about?
There is a TO CCSID but I didn't find a FROM CCSID.

-----Original Message-----
From: MIDRANGE-L <midrange-l-bounces@xxxxxxxxxxxxxxxxxx> On Behalf Of
Vernon Hamberg
Sent: Monday, May 18, 2020 11:47 AM
To: Midrange Systems Technical Discussion
<midrange-l@xxxxxxxxxxxxxxxxxx>
Subject: Re: Chinese characters in an interface file

You probably need to tag the IFS file as 1208 instead of 819 - or CPY
it binary to an IFS file so tagged.

Or is there a setting on CPYFRMIMPF for CCSID, from and to? Used to be
one for code page, but you should not use that, IIRC.

Vern

On 5/18/2020 10:14 AM, smith5646midrange@xxxxxxxxx wrote:
I must be missing a step because I did not get what I thought that I
would.
The file on our IFS is CCSID 819.
I created a DB2 file that is CCSID 37 except for the two fields that
are problems and they are defined as 1208.
I did a CPYFRMIMPF and did not change any of the default values
except record delimiter.
I created a test RPGLE program that read the DB2 file and then moved
one of the 1208 fields to a work field in the program.
The value in the work field is identical to the 1208 field and does
not contain hex 3F chars.

What did I miss?

-----Original Message-----
From: MIDRANGE-L <midrange-l-bounces@xxxxxxxxxxxxxxxxxx> On Behalf Of
Barbara Morris
Sent: Wednesday, May 13, 2020 12:14 PM
To: midrange-l@xxxxxxxxxxxxxxxxxx
Subject: Re: Chinese characters in an interface file

On 2020-05-13 11:22 a.m., Vernon Hamberg wrote:
Hi

It is possible that the CSV file is in UTF-8 -- the hex 31 you
describe is what the number 1 would be in UTF-8.
...

If it IS UTF-8, you might try marking the file with CCSID 1208 and
do a text transfer, not a binary transfer.

Or mark the field in your PF as 1208 CCSID, do the binary transfer,
then see what RPG does with it. In RPG, do and EVAL from the UTF-8
field to a regular 37 EBCDIC field.
...
I agree with Vern that it is likely that you have UTF-8 data.

If you get the data into a UTF-8 RPG field and assign it to a CCSID
37 using RPG, the characters that can't convert to CCSID(37) will be
x'3F'.
You could then scan for x'3F' and blank the field if you find one.

Or, if you're specifically only concerned with Chinese, you could
assign to a CCSID(937) field, and then scan for x'0E' to see if there
are any Chinese characters.

But ... following the Never Lose Data principle, I think it would be
a zillion times better to save the UTF-8 data as is and not try to
convert it to CCSID 37, and especially not blank the field if it has
Chinese data.
Just guessing, but assuming you do need this in CCSID 37, wouldn't it
be better to just replace the unconvertable characters with say '?'
than to blank out the entire field?

--
Barbara

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L)
mailing list To post a message email: MIDRANGE-L@xxxxxxxxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: https://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives at
https://archive.midrange.com/midrange-l.

Please contact support@xxxxxxxxxxxxxxxxxxxx for any subscription
related questions.

Help support midrange.com by shopping at amazon.com with our
affiliate
link:
https://amazon.midrange.com

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing
list To post a message email: MIDRANGE-L@xxxxxxxxxxxxxxxxxx To
subscribe, unsubscribe, or change list options,
visit: https://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives at
https://archive.midrange.com/midrange-l.

Please contact support@xxxxxxxxxxxxxxxxxxxx for any subscription
related questions.

Help support midrange.com by shopping at amazon.com with our affiliate
link:
https://amazon.midrange.com

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxxxxxxxx To subscribe,
unsubscribe, or change list options,
visit: https://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives at
https://archive.midrange.com/midrange-l.

Please contact support@xxxxxxxxxxxxxxxxxxxx for any subscription related
questions.

Help support midrange.com by shopping at amazon.com with our affiliate link:
https://amazon.midrange.com



As an Amazon Associate we earn from qualifying purchases.

This thread ...

Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.