|
Chuck, I appreciate the full detail. We are reviewing our options. We take in data from many sources. Some of the imports and exports have run 15+ years with little change. What has changed is the front end, both ours and from our trading partners. This will take a little time to digest and implement a change. Jim On Mon, May 11, 2015 at 12:38 PM, CRPence <crpbottle@xxxxxxxxx> wrote: > On 08-May-2015 13:00 -0500, Jim Franz wrote: > >> We do a lot of import and export of data, plus have both PC client >> (local and web) input as well at PC5250. Had a recent thread >> involving cut and paste data (ebcdic x'3F') that caused an issue. We >> use CCSID 37 and ascii 819. >> > > If data comes from the web [or other ASCII sources], best choice is > likely to choose to receive and store data as UTF-8, because that is > probably what the data is coming from the web\elsewhere. One might expect > that the stored EBCDIC data is already being converted to UTF-8 for any > transmittal, so being stored as UTF-8 way would eliminate that conversion > activity [from EBCDIC]. > > >> There are more EBCDIC characters than what we see on the US Keyboard. >> Some we need, such as copyright symbol, cents sign, etc, but many >> > > "but many..." ¿other characters are never utilized and thus we don't > really care about or /need/ to store\transmit such characters? > > If the EBCDIC characters are transmitted as UTF-8, then if they are > used, then they would be recognized; there would be little reason to limit > the EBCDIC characters for that reason. If that EBCDIC data is transferred > as EBCDIC across multiple EBCDIC CCSIDs, then choosing to minimize the > number of variant characters would reduce chances for the type of /data > loss/ whereby the code points no longer match the expected characters. As > long as the stored data has either a visible glyph or an intended effect > [such as tab, CRLF, etc.] that enables someone to see\review the data, then > there is little reason to be concerned with how [un]popular or how > [un]needed is each character, except perhaps compression for storage > purposes, if the data is only being stored and\or is sent-out as UTF-8. > > >> We are wanting to take steps to clean the data on input, whether from >> ascii or ebcdic side. We have some input already cleansed, but only >> at screen program level. >> > > IIRC the 0x3F issue originally discussed was the result of copying from > web page data [or other ASCII document source] that was later pasted onto a > 5250 screen. In that case I believe the 5250 program could scan for those > SUB characters to address that issue directly, because they got there > before the data was ever sent to the database.? > > FWiW as some possible added value, here is a link to my reply [probably > see entire thread] in a topic a couple years ago with "Subject: Can you > select replacement character in translation from UNICODE (UTF-16)" < > http://archive.midrange.com/midrange-l/201304/msg00985.html> and another > I had replied to twice in the following discussion [in Italian, so > reviewing the other messages might require an online translation] with > "Subject: FTP + STMF + CCSID" < > http://it.comp.as400.narkive.com/7nGnQ43I/ftp-stmf-ccsid> > > >> Couple questions: >> >> 1. Just replacing all below ebcdic x'40' leaves a lot of strange >> characters like x'8C' (sort of a moon with a hat..). >> > > That x'8C' is the character Id LD630000 "eth Icelandic Small"; > • to see the proper glyph with that character identifier\name with the > code point 0x8C in Code Page 37: > < > ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00037.pdf > > > • to see the character identifier (GCGID) with descriptive text (GCGID > Name) in a text-only file [lacking a glyph] collated according to the CP 37: > < > ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00037.txt > > > > >> One thought is to identify all the characters we need and replace the >> rest. No need to keep line and page formatting stuff. Is this a good >> idea? >> >> > If data-loss is not a concern, then /good enough/ I suppose; quite > difficult for anyone from /outside/ to make that decision [for you], and > any advice given from outside would be quite blind and perhaps even quite > inappropriate for not knowing the application and business requirements... > but /data loss/ is in my experience, rarely considered acceptable in a > business application, given that data-loss issues probably can lead to > anger and frustration and possibly even lead to client-loss. As for > anything translated to EBCDIC control characters [e.g. "line and page > formatting stuff"], the translating to blanks anything less than a blank is > probably not too harmful, but possibly frustrating to anyone unable to see > the original\intended formatting esp. if that /same data/ was later resent > as ASCII data [as converted from the EBCDIC data with all of the > control-characters since replaced with just spaces]; e.g. tabbed data can > look spectacularly bad when the tabs are lost or replaced with non-tab > characters. > > But I wonder, does that statement ["identify and replace"] apply to the > ASCII data coming in? And if so, does that perhaps already effectively > happen per the substitution character replacing those characters for which > there is no EBCDIC character representation, when that ASCII data is stored > as EBCDIC; if not per the screen I\O from the paste of copied data, then > upon [implicit] conversion as Input to the database file? Wondering, > because most [if not] everything that is valid for EBCDIC in the CCSID 37 > being utilized will be valid going into ASCII for any of 819, 1252, or > UTF-8.? > > FWiW: The following SQL expression replaces the x'3F' with the empty > string, and the next expression replaces the data with an underscore > character [could just as well be a space character, but a visual glyph left > in place can be more revealing and is considered a /significant/ character > even at the end of a string] for which all other character positions remain > unaltered, so for example, the string length remains the same [which is not > the case for replace by an empty string and may not be the case for replace > by a blank *if* the last character is replaced]: > > replace( :the_string, x'3F', '' ) > > replace( :the_string, x'3F', '_' ) > > FWiW: Another SQL expression that might be used to replace a known list > of characters that are undesirable to be stored [including the substitution > (SUB) character 0x3F]; see also "Subject: Strip list of characters from a > field using SQL" < > http://archive.midrange.com/rpg400-l/201208/msg00160.html>: > > replace( translate( :the_string, x'3F' > , 'listofunwantedcharacters' > , x'3F' ) , x'3F', '' ) > > >> 2. Thinking that since a multitude of entry/update points, db >> triggers are best? Am wondering about apps that write the data, and >> now after write, the screen column data is different than column data >> in file (trigger pgm cleaned the data - hoping to avoid opening up >> all the apps. >> >> > Best overall might be choosing a more expansive data type\CCSID to be > more accommodating as the recipient, into which the data is written; i.e. > change the database to accept UTF-8 [or UCS2 or UTF-16] where SBCS EBCDIC > is lacking. However what might be _best_ as an attempt to eliminate making > changes to the underlying database or a specific TABLE [that are externally > visible to the applications], then the TRIGGER seems an appropriate choice. > > A CHECK CONSTRAINT or an error manifest from the TRIGGER for the type of > I\O instead of implicit /cleansing/ might be more appropriate to ameliorate > that concern about "the screen column data is different than column data". > But then addressing every application that might encounter those errors to > ensure that in response, all programs would invoke the required procedure > to identify and\or correct the data in the manner compatible with the > application, is probably going to be just as worrisome as the database data > mismatching the screen data. > > One might argue that the triggers could send up a non-terminating > message [e.g. an informational message; I am unsure of any capability to > send either of a Notify or Status, but those are both Exception class > messages] to notify of the data change, but the applications would all have > to know to /look for/ and to handle that notification, such as to redisplay > the updated data to confirm that what is displayed [in contrast to what was > originally entered] serves as an acceptable alternative to the data > originally attempted to be sent to the database. But triggers really need > to be independent of the application and just do their thing; quite > different requirements to try to re-display a web page vs a 5250 screen in > response to a trigger that deems the data requires a prompted review to > confirm. > > A better idea in combination with a trigger that modifies the inputs is > to change the applications to use an effective SELECT FROM INSERT > implementation [whether implemented as SQL or native] to compare the data > from the database with what was in the buffer as what was intended to be > written. That would be more appropriate as changes to the application than > the application having any expectation of what the trigger should > communicate back to the application *other than* an I\O error; i.e. again, > the trigger should not try to communicate with the programs that might > support disparate interfaces to an application. > > >> 3. How far do people with heavy edi take this? Am I leaving some >> something out with the keyboard characters plus a few more? These are >> names, addresses, notes (which are sometimes pages of notes). >> >> > I figured most transactions would have been transmitted in UTF-8, > especially for any free-form data [such as /notes/ fields], and thus best > stored as UTF-8 if such data were to be stored; names and addresses being > nearly free-form per each likely to contain /words/ that are less likely > able to be represented in only the available EBCDIC characters. The much > more structured data elements often are constrained by what characters can > appear, e.g. numeric-only with limited formatting or codes limited to EUC > and digits, so those are more likely to be easily stored as EBCDIC [without > any data-loss issues] because there is going to be a one-to-one correlation > for each character between EBCDIC and ASCII [aka having a valid round-trip > for each character conversion]. > > -- > Regards, Chuck > > > -- > This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list > To post a message email: MIDRANGE-L@xxxxxxxxxxxx > To subscribe, unsubscribe, or change list options, > visit: http://lists.midrange.com/mailman/listinfo/midrange-l > or email: MIDRANGE-L-request@xxxxxxxxxxxx > Before posting, please take a moment to review the archives > at http://archive.midrange.com/midrange-l. > >
As an Amazon Associate we earn from qualifying purchases.
This mailing list archive is Copyright 1997-2025 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].
Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.