|
Jim,In file transfer situations, I would never trust the CCSID file attribute (unless you've already made sure that it's right, of course).
Unless you're transferring a save file from another IBM i system/partition, the CCSID is not part of what gets transferred. All that's transferred is the data itself. The system will usually just assign a 'default' CCSID -- it has no way of knowing if it's the right one for your data. It expects you to change it accordingly if your data is different.
If you are finding that a single character (such as a "smart quote" or international symbol) is showing up as two bytes of data, resulting in extra 'garbage' when translated to EBCDIC, this almost always means that the data is UTF-8, but you're telling the system that it's ASCII (such as 819) and therefore it will translate the basic alphabet and numbers correctly, but more 'special' characters will be mistranslated.
Really, considering that it's 2015, we should all be using Unicode (UTF-8 or UTF-16) for as much as possible. ASCII and EBCDIC are really cumbersome. But, I know it's hard when you have so many applications that are already in EBCDIC -- but an all-unicode environment is really what you should be striving for in the long run, if you can't do it today.
Anyway -- how to "purify" the data -- there are certain commonplace issues, such as replacing "smart quotes" with straight quotes that make sense to do. I would definitely do this in Unicode (or ASCII if that's what it is) before translating to EBCDIC.
But aside from these common things, it's general ugly and nasty to remove "unwanted" characters. There's no good way to do this, since there's really no way the computer knows which characters are "allowed" and which are not. How does it know whether a half-moon character, for example, is intentional or whether it's an error? Same is true of accented characters -- often times people (at least in the USA) will see these and say they are "garbage" -- but, they are normal parts of human languages in most of the world. How can the computer know that they are "garbage"? Obviously, it's easy for us as human beings to look at the data and realize that a particular character doesn't belong there -- but I'm sure you understand that a computer can't see things that way.
So I guess if you want to "purify" your data, the BEST way to do that is to find out where these unwanted characters are coming from, and have it stop sending them. If you really, truly, can't do that then the "hack" would be to make a list of everything you DO want, and remove everything else. What is/isn't a wanted character will almost certainly vary from application to application, so there isn't really any built-in way to do this. Just make a string of all the characters you want, and use RPG operations like %CHECK to find the ones not in that character set and remove them. But, this really is a hack...
On 5/8/2015 1:33 PM, Jim Franz wrote:
without asking every entity, can one tell looking at the file attributes? Jim On Fri, May 8, 2015 at 2:28 PM, Henrik Rützou <hr@xxxxxxxxxxxx> wrote:Jim even if the files you receive is in CSSID 819/1252 are you sure that they isn't UTF-8 files? On Fri, May 8, 2015 at 8:25 PM, Jim Franz <franz9000@xxxxxxxxx> wrote:EBCDIC CCSID = 37 Most file imports are via ftp - ccsid 1252, occasionally burned dvd fornewcustomer startup of history. Some trading partners are mainframe, some unix/Linux, some Win, all US based entities, but we think some servers are overseas (we see time differences). When we write ascii text, usually 819 what hurts us most is screen input (web interface to SQL Server then to Power i) where user cuts & pastes paragraphs of text from their source systems (thousands of different customers). Jim On Fri, May 8, 2015 at 2:07 PM, Henrik Rützou <hr@xxxxxxxxxxxx> wrote:Jim what is the EBCDIC CSSid on your machine and how do you recieve files? On Fri, May 8, 2015 at 8:00 PM, Jim Franz <franz9000@xxxxxxxxx> wrote:We do a lot of import and export of data, plus have both PC client(localand web) input as well at PC5250. Had a recent thread involving cut and paste data (ebcdic x'3F') thatcausedan issue. We use CCSID 37 and ascii 819. There are more EBCDIC characters than what we see on the US Keyboard.Somewe need, such as copyright symbol, cents sign, etc, but many We are wanting to take steps to clean the data on input, whether fromasciior ebcdic side. We have some input already cleansed, but only atscreenprogram level. Couple questions: 1. Just replacing all below ebcdic x'40' leaves a lot of strange characters like x'8C' (sort of a moon with a hat..). One thought istoidentify all the characters we need and replace the rest. No need tokeepline and page formatting stuff. Is this a good idea? 2. Thinking that since a multitude of entry/update points, dbtriggersarebest? Am wondering about apps that write the data, and now afterwrite,thescreen column data is different than column data in file (trigger pgm cleaned the data - hoping to avoid opening up all the apps. 3. How far do people with heavy edi take this? Am I leaving somesomethingout with the keyboard characters plus a few more? These are names, addresses, notes (which are sometimes pages of notes). Jim Franz -- This is the Midrange Systems Technical Discussion (MIDRANGE-L)mailinglistTo post a message email: MIDRANGE-L@xxxxxxxxxxxx To subscribe, unsubscribe, or change list options, visit: http://lists.midrange.com/mailman/listinfo/midrange-l or email: MIDRANGE-L-request@xxxxxxxxxxxx Before posting, please take a moment to review the archives at http://archive.midrange.com/midrange-l.-- Regards, Henrik Rützou http://powerEXT.com <http://powerext.com/> -- This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailinglistTo post a message email: MIDRANGE-L@xxxxxxxxxxxx To subscribe, unsubscribe, or change list options, visit: http://lists.midrange.com/mailman/listinfo/midrange-l or email: MIDRANGE-L-request@xxxxxxxxxxxx Before posting, please take a moment to review the archives at http://archive.midrange.com/midrange-l.-- This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailinglistTo post a message email: MIDRANGE-L@xxxxxxxxxxxx To subscribe, unsubscribe, or change list options, visit: http://lists.midrange.com/mailman/listinfo/midrange-l or email: MIDRANGE-L-request@xxxxxxxxxxxx Before posting, please take a moment to review the archives at http://archive.midrange.com/midrange-l.-- Regards, Henrik Rützou http://powerEXT.com <http://powerext.com/> -- This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list To post a message email: MIDRANGE-L@xxxxxxxxxxxx To subscribe, unsubscribe, or change list options, visit: http://lists.midrange.com/mailman/listinfo/midrange-l or email: MIDRANGE-L-request@xxxxxxxxxxxx Before posting, please take a moment to review the archives at http://archive.midrange.com/midrange-l.
As an Amazon Associate we earn from qualifying purchases.
This mailing list archive is Copyright 1997-2025 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].
Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.