Re: Problem copying PC document to DDS file with accentuated characters -- MIDRANGE-L

UTF-8 and UTF-16 MIGHT have what is called a BOM, or byte order mark. For UTF-16 it is either FE FF or FF FE, the former means big-endian, the latter means little-endian. So the letter "t" is x'7400' in little-endian, x'0074' in big-endian - at least, that's how TextPad just saved a couple test text files. Big-endian means, to me, that the bytes are in the order we would read them, left to right, little-endian has the order flipped. Anyone confirm that some machine code stuff is done better with little-endian?

Now it is not required to use a BOM. I suppose one can identify which flavor of UTF-16 you have when you determine if a null is in an odd position or an even position - even is LE, odd is BE - not sure how anyone else, such as NotePad++, does that.

But UTF-8 is another creature - it doesn't have endian flavors but does have a BOM, EF BB BF - it also is not required, and if it's absent, you have a real guessing game on your hands. On the i, what I did was try to copy the text file to one with 1208 CCSID - if successful, I considered the contents to be UTF-8. Not great but mostly useful. There ARE certain byte sequences that, I will say, probably can be sure to mean the contents is UTF-8, of course, the first 128 or 256, I forget, is single-byte and essentially the same as some flavor of ASCII.

Probably for TMI on a Wednesday morning just before April Fool's Day! Good luck getting a straight answer tomorrow!

Patrik, by the flag in metadata, do you mean the CCSID or the code page? Other systems don't use those at all, there's basically no metadata that I know of in text files. Now PK-ZIP files, they have PK in the first 2 bytes, other file types do similar markings. But not text files.

Regards
Vern

On 3/31/2021 9:22 AM, Patrik Schindler wrote:

Hello Dave,

Am 31.03.2021 um 12:27 schrieb Dave <dfx1@xxxxxxxxxxxxxx>:

I've managed to copy the data in a readable form.

Glad you found a solution.

Thanks to Rob and Patrik, also an interesting analogy I read on jam and jam
jar labels that gave me a vital clue (just because it said CCSID 850 on the
label does not guarantee that it is CCSID 850)!

Exactly. Notepad++ looks *inside* and makes guesses about the charset. If you have a file without any umlauts or accentual characters, it surely will tell you the content is supposed to be US-ASCII.

IBM i and predecessors have a flag in the file's metadata. But this is also no guarantee to contain the correct designation if something bad happened when uploading the file to IBM i.
It's a bit the same as when you rename a .jpg file to .doc. The OS thinks, it's a word file, but word (probably) can't read the binary JPEG format.

It comes down to find a way to make the text inside, and the metadata outside the file consistent.

:wq! PoC