× The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.



On 08-May-2015 13:00 -0500, Jim Franz wrote:
We do a lot of import and export of data, plus have both PC client
(local and web) input as well at PC5250. Had a recent thread
involving cut and paste data (ebcdic x'3F') that caused an issue. We
use CCSID 37 and ascii 819.

If data comes from the web [or other ASCII sources], best choice is likely to choose to receive and store data as UTF-8, because that is probably what the data is coming from the web\elsewhere. One might expect that the stored EBCDIC data is already being converted to UTF-8 for any transmittal, so being stored as UTF-8 way would eliminate that conversion activity [from EBCDIC].


There are more EBCDIC characters than what we see on the US Keyboard.
Some we need, such as copyright symbol, cents sign, etc, but many

"but many..." ¿other characters are never utilized and thus we don't really care about or /need/ to store\transmit such characters?

If the EBCDIC characters are transmitted as UTF-8, then if they are used, then they would be recognized; there would be little reason to limit the EBCDIC characters for that reason. If that EBCDIC data is transferred as EBCDIC across multiple EBCDIC CCSIDs, then choosing to minimize the number of variant characters would reduce chances for the type of /data loss/ whereby the code points no longer match the expected characters. As long as the stored data has either a visible glyph or an intended effect [such as tab, CRLF, etc.] that enables someone to see\review the data, then there is little reason to be concerned with how [un]popular or how [un]needed is each character, except perhaps compression for storage purposes, if the data is only being stored and\or is sent-out as UTF-8.


We are wanting to take steps to clean the data on input, whether from
ascii or ebcdic side. We have some input already cleansed, but only
at screen program level.

IIRC the 0x3F issue originally discussed was the result of copying from web page data [or other ASCII document source] that was later pasted onto a 5250 screen. In that case I believe the 5250 program could scan for those SUB characters to address that issue directly, because they got there before the data was ever sent to the database.?

FWiW as some possible added value, here is a link to my reply [probably see entire thread] in a topic a couple years ago with "Subject: Can you select replacement character in translation from UNICODE (UTF-16)" <http://archive.midrange.com/midrange-l/201304/msg00985.html> and another I had replied to twice in the following discussion [in Italian, so reviewing the other messages might require an online translation] with "Subject: FTP + STMF + CCSID" <http://it.comp.as400.narkive.com/7nGnQ43I/ftp-stmf-ccsid>


Couple questions:

1. Just replacing all below ebcdic x'40' leaves a lot of strange
characters like x'8C' (sort of a moon with a hat..).

  That x'8C' is the character Id LD630000 "eth Icelandic Small";
• to see the proper glyph with that character identifier\name with the code point 0x8C in Code Page 37:
<ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00037.pdf>
• to see the character identifier (GCGID) with descriptive text (GCGID Name) in a text-only file [lacking a glyph] collated according to the CP 37:
<ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00037.txt>


One thought is to identify all the characters we need and replace the
rest. No need to keep line and page formatting stuff. Is this a good
idea?


If data-loss is not a concern, then /good enough/ I suppose; quite difficult for anyone from /outside/ to make that decision [for you], and any advice given from outside would be quite blind and perhaps even quite inappropriate for not knowing the application and business requirements... but /data loss/ is in my experience, rarely considered acceptable in a business application, given that data-loss issues probably can lead to anger and frustration and possibly even lead to client-loss. As for anything translated to EBCDIC control characters [e.g. "line and page formatting stuff"], the translating to blanks anything less than a blank is probably not too harmful, but possibly frustrating to anyone unable to see the original\intended formatting esp. if that /same data/ was later resent as ASCII data [as converted from the EBCDIC data with all of the control-characters since replaced with just spaces]; e.g. tabbed data can look spectacularly bad when the tabs are lost or replaced with non-tab characters.

But I wonder, does that statement ["identify and replace"] apply to the ASCII data coming in? And if so, does that perhaps already effectively happen per the substitution character replacing those characters for which there is no EBCDIC character representation, when that ASCII data is stored as EBCDIC; if not per the screen I\O from the paste of copied data, then upon [implicit] conversion as Input to the database file? Wondering, because most [if not] everything that is valid for EBCDIC in the CCSID 37 being utilized will be valid going into ASCII for any of 819, 1252, or UTF-8.?

FWiW: The following SQL expression replaces the x'3F' with the empty string, and the next expression replaces the data with an underscore character [could just as well be a space character, but a visual glyph left in place can be more revealing and is considered a /significant/ character even at the end of a string] for which all other character positions remain unaltered, so for example, the string length remains the same [which is not the case for replace by an empty string and may not be the case for replace by a blank *if* the last character is replaced]:

    replace( :the_string, x'3F', '' )

    replace( :the_string, x'3F', '_' )

FWiW: Another SQL expression that might be used to replace a known list of characters that are undesirable to be stored [including the substitution (SUB) character 0x3F]; see also "Subject: Strip list of characters from a field using SQL" <http://archive.midrange.com/rpg400-l/201208/msg00160.html>:

    replace( translate( :the_string, x'3F'
      , 'listofunwantedcharacters'
      , x'3F'         ) , x'3F', '' )


2. Thinking that since a multitude of entry/update points, db
triggers are best? Am wondering about apps that write the data, and
now after write, the screen column data is different than column data
in file (trigger pgm cleaned the data - hoping to avoid opening up
all the apps.


Best overall might be choosing a more expansive data type\CCSID to be more accommodating as the recipient, into which the data is written; i.e. change the database to accept UTF-8 [or UCS2 or UTF-16] where SBCS EBCDIC is lacking. However what might be _best_ as an attempt to eliminate making changes to the underlying database or a specific TABLE [that are externally visible to the applications], then the TRIGGER seems an appropriate choice.

A CHECK CONSTRAINT or an error manifest from the TRIGGER for the type of I\O instead of implicit /cleansing/ might be more appropriate to ameliorate that concern about "the screen column data is different than column data". But then addressing every application that might encounter those errors to ensure that in response, all programs would invoke the required procedure to identify and\or correct the data in the manner compatible with the application, is probably going to be just as worrisome as the database data mismatching the screen data.

One might argue that the triggers could send up a non-terminating message [e.g. an informational message; I am unsure of any capability to send either of a Notify or Status, but those are both Exception class messages] to notify of the data change, but the applications would all have to know to /look for/ and to handle that notification, such as to redisplay the updated data to confirm that what is displayed [in contrast to what was originally entered] serves as an acceptable alternative to the data originally attempted to be sent to the database. But triggers really need to be independent of the application and just do their thing; quite different requirements to try to re-display a web page vs a 5250 screen in response to a trigger that deems the data requires a prompted review to confirm.

A better idea in combination with a trigger that modifies the inputs is to change the applications to use an effective SELECT FROM INSERT implementation [whether implemented as SQL or native] to compare the data from the database with what was in the buffer as what was intended to be written. That would be more appropriate as changes to the application than the application having any expectation of what the trigger should communicate back to the application *other than* an I\O error; i.e. again, the trigger should not try to communicate with the programs that might support disparate interfaces to an application.


3. How far do people with heavy edi take this? Am I leaving some
something out with the keyboard characters plus a few more? These are
names, addresses, notes (which are sometimes pages of notes).


I figured most transactions would have been transmitted in UTF-8, especially for any free-form data [such as /notes/ fields], and thus best stored as UTF-8 if such data were to be stored; names and addresses being nearly free-form per each likely to contain /words/ that are less likely able to be represented in only the available EBCDIC characters. The much more structured data elements often are constrained by what characters can appear, e.g. numeric-only with limited formatting or codes limited to EUC and digits, so those are more likely to be easily stored as EBCDIC [without any data-loss issues] because there is going to be a one-to-one correlation for each character between EBCDIC and ASCII [aka having a valid round-trip for each character conversion].


As an Amazon Associate we earn from qualifying purchases.

This thread ...

Follow-Ups:
Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.