Re: Ascii and EBCDIC translation -- MIDRANGE-L

Chuck,
I appreciate the full detail. We are reviewing our options. We take in data
from many sources. Some of the imports and exports have run 15+ years with
little change. What has changed is the front end, both ours and from our
trading partners. This will take a little time to digest and implement a
change.
Jim

On Mon, May 11, 2015 at 12:38 PM, CRPence <crpbottle@xxxxxxxxx> wrote:

> On 08-May-2015 13:00 -0500, Jim Franz wrote:
>
>> We do a lot of import and export of data, plus have both PC client
>> (local and web) input as well at PC5250. Had a recent thread
>> involving cut and paste data (ebcdic x'3F') that caused an issue. We
>> use CCSID 37 and ascii 819.
>>
>
>   If data comes from the web [or other ASCII sources], best choice is
> likely to choose to receive and store data as UTF-8, because that is
> probably what the data is coming from the web\elsewhere.  One might expect
> that the stored EBCDIC data is already being converted to UTF-8 for any
> transmittal, so being stored as UTF-8 way would eliminate that conversion
> activity [from EBCDIC].
>
>
>> There are more EBCDIC characters than what we see on the US Keyboard.
>> Some we need, such as copyright symbol, cents sign, etc, but many
>>
>
>   "but many..." ¿other characters are never utilized and thus we don't
> really care about or /need/ to store\transmit such characters?
>
>   If the EBCDIC characters are transmitted as UTF-8, then if they are
> used, then they would be recognized; there would be little reason to limit
> the EBCDIC characters for that reason.  If that EBCDIC data is transferred
> as EBCDIC across multiple EBCDIC CCSIDs, then choosing to minimize the
> number of variant characters would reduce chances for the type of /data
> loss/ whereby the code points no longer match the expected characters.  As
> long as the stored data has either a visible glyph or an intended effect
> [such as tab, CRLF, etc.] that enables someone to see\review the data, then
> there is little reason to be concerned with how [un]popular or how
> [un]needed is each character, except perhaps compression for storage
> purposes, if the data is only being stored and\or is sent-out as UTF-8.
>
>
>> We are wanting to take steps to clean the data on input, whether from
>> ascii or ebcdic side. We have some input already cleansed, but only
>> at screen program level.
>>
>
>   IIRC the 0x3F issue originally discussed was the result of copying from
> web page data [or other ASCII document source] that was later pasted onto a
> 5250 screen.  In that case I believe the 5250 program could scan for those
> SUB characters to address that issue directly, because they got there
> before the data was ever sent to the database.?
>
>   FWiW as some possible added value, here is a link to my reply [probably
> see entire thread] in a topic a couple years ago with "Subject: Can you
> select replacement character in translation from UNICODE (UTF-16)" <
> http://archive.midrange.com/midrange-l/201304/msg00985.html> and another
> I had replied to twice in the following discussion [in Italian, so
> reviewing the other messages might require an online translation] with
> "Subject: FTP + STMF + CCSID" <
> http://it.comp.as400.narkive.com/7nGnQ43I/ftp-stmf-ccsid>
>
>
>> Couple questions:
>>
>> 1. Just replacing all below ebcdic x'40' leaves a lot of strange
>> characters like x'8C' (sort of a moon with a hat..).
>>
>
>   That x'8C' is the character Id LD630000 "eth Icelandic Small";
>  • to see the proper glyph with that character identifier\name with the
> code point 0x8C in Code Page 37:
> <
> ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00037.pdf
> >
>  • to see the character identifier (GCGID) with descriptive text (GCGID
> Name) in a text-only file [lacking a glyph] collated according to the CP 37:
> <
> ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00037.txt
> >
>
>
>> One thought is to identify all the characters we need and replace the
>> rest. No need to keep line and page formatting stuff. Is this a good
>> idea?
>>
>>
>   If data-loss is not a concern, then /good enough/ I suppose; quite
> difficult for anyone from /outside/ to make that decision [for you], and
> any advice given from outside would be quite blind and perhaps even quite
> inappropriate for not knowing the application and business requirements...
> but /data loss/ is in my experience, rarely considered acceptable in a
> business application, given that data-loss issues probably can lead to
> anger and frustration and possibly even lead to client-loss.  As for
> anything translated to EBCDIC control characters [e.g. "line and page
> formatting stuff"], the translating to blanks anything less than a blank is
> probably not too harmful, but possibly frustrating to anyone unable to see
> the original\intended formatting esp. if that /same data/ was later resent
> as ASCII data [as converted from the EBCDIC data with all of the
> control-characters since replaced with just spaces]; e.g. tabbed data can
> look spectacularly bad when the tabs are lost or replaced with non-tab
> characters.
>
>   But I wonder, does that statement ["identify and replace"] apply to the
> ASCII data coming in?  And if so, does that perhaps already effectively
> happen per the substitution character replacing those characters for which
> there is no EBCDIC character representation, when that ASCII data is stored
> as EBCDIC; if not per the screen I\O from the paste of copied data, then
> upon [implicit] conversion as Input to the database file?  Wondering,
> because most [if not] everything that is valid for EBCDIC in the CCSID 37
> being utilized will be valid going into ASCII for any of 819, 1252, or
> UTF-8.?
>
>   FWiW: The following SQL expression replaces the x'3F' with the empty
> string, and the next expression replaces the data with an underscore
> character [could just as well be a space character, but a visual glyph left
> in place can be more revealing and is considered a /significant/ character
> even at the end of a string] for which all other character positions remain
> unaltered, so for example, the string length remains the same [which is not
> the case for replace by an empty string and may not be the case for replace
> by a blank *if* the last character is replaced]:
>
>     replace( :the_string, x'3F', '' )
>
>     replace( :the_string, x'3F', '_' )
>
>   FWiW: Another SQL expression that might be used to replace a known list
> of characters that are undesirable to be stored [including the substitution
> (SUB) character 0x3F]; see also "Subject: Strip list of characters from a
> field using SQL" <
> http://archive.midrange.com/rpg400-l/201208/msg00160.html>:
>
>     replace( translate( :the_string, x'3F'
>       , 'listofunwantedcharacters'
>       , x'3F'         ) , x'3F', '' )
>
>
>> 2. Thinking that since a multitude of entry/update points, db
>> triggers are best? Am wondering about apps that write the data, and
>> now after write, the screen column data is different than column data
>> in file (trigger pgm cleaned the data - hoping to avoid opening up
>> all the apps.
>>
>>
>   Best overall might be choosing a more expansive data type\CCSID to be
> more accommodating as the recipient, into which the data is written; i.e.
> change the database to accept UTF-8 [or UCS2 or UTF-16] where SBCS EBCDIC
> is lacking.  However what might be _best_ as an attempt to eliminate making
> changes to the underlying database or a specific TABLE [that are externally
> visible to the applications], then the TRIGGER seems an appropriate choice.
>
>   A CHECK CONSTRAINT or an error manifest from the TRIGGER for the type of
> I\O instead of implicit /cleansing/ might be more appropriate to ameliorate
> that concern about "the screen column data is different than column data".
> But then addressing every application that might encounter those errors to
> ensure that in response, all programs would invoke the required procedure
> to identify and\or correct the data in the manner compatible with the
> application, is probably going to be just as worrisome as the database data
> mismatching the screen data.
>
>   One might argue that the triggers could send up a non-terminating
> message [e.g. an informational message; I am unsure of any capability to
> send either of a Notify or Status, but those are both Exception class
> messages] to notify of the data change, but the applications would all have
> to know to /look for/ and to handle that notification, such as to redisplay
> the updated data to confirm that what is displayed [in contrast to what was
> originally entered] serves as an acceptable alternative to the data
> originally attempted to be sent to the database.  But triggers really need
> to be independent of the application and just do their thing; quite
> different requirements to try to re-display a web page vs a 5250 screen in
> response to a trigger that deems the data requires a prompted review to
> confirm.
>
>   A better idea in combination with a trigger that modifies the inputs is
> to change the applications to use an effective SELECT FROM INSERT
> implementation [whether implemented as SQL or native] to compare the data
> from the database with what was in the buffer as what was intended to be
> written.  That would be more appropriate as changes to the application than
> the application having any expectation of what the trigger should
> communicate back to the application *other than* an I\O error; i.e. again,
> the trigger should not try to communicate with the programs that might
> support disparate interfaces to an application.
>
>
>> 3. How far do people with heavy edi take this? Am I leaving some
>> something out with the keyboard characters plus a few more? These are
>> names, addresses, notes (which are sometimes pages of notes).
>>
>>
>   I figured most transactions would have been transmitted in UTF-8,
> especially for any free-form data [such as /notes/ fields], and thus best
> stored as UTF-8 if such data were to be stored; names and addresses being
> nearly free-form per each likely to contain /words/ that are less likely
> able to be represented in only the available EBCDIC characters. The much
> more structured data elements often are constrained by what characters can
> appear, e.g. numeric-only with limited formatting or codes limited to EUC
> and digits, so those are more likely to be easily stored as EBCDIC [without
> any data-loss issues] because there is going to be a one-to-one correlation
> for each character between EBCDIC and ASCII [aka having a valid round-trip
> for each character conversion].
>
> --
> Regards, Chuck
>
>
> --
> This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
> To post a message email: MIDRANGE-L@xxxxxxxxxxxx
> To subscribe, unsubscribe, or change list options,
> visit: http://lists.midrange.com/mailman/listinfo/midrange-l
> or email: MIDRANGE-L-request@xxxxxxxxxxxx
> Before posting, please take a moment to review the archives
> at http://archive.midrange.com/midrange-l.
>
>