×
The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.
UTF-8 and UTF-16 MIGHT have what is called a BOM, or byte order mark.
For UTF-16 it is either FE FF or FF FE, the former means big-endian, the
latter means little-endian. So the letter "t" is x'7400' in
little-endian, x'0074' in big-endian - at least, that's how TextPad just
saved a couple test text files. Big-endian means, to me, that the bytes
are in the order we would read them, left to right, little-endian has
the order flipped. Anyone confirm that some machine code stuff is done
better with little-endian?
Now it is not required to use a BOM. I suppose one can identify which
flavor of UTF-16 you have when you determine if a null is in an odd
position or an even position - even is LE, odd is BE - not sure how
anyone else, such as NotePad++, does that.
But UTF-8 is another creature - it doesn't have endian flavors but does
have a BOM, EF BB BF - it also is not required, and if it's absent, you
have a real guessing game on your hands. On the i, what I did was try to
copy the text file to one with 1208 CCSID - if successful, I considered
the contents to be UTF-8. Not great but mostly useful. There ARE certain
byte sequences that, I will say, probably can be sure to mean the
contents is UTF-8, of course, the first 128 or 256, I forget, is
single-byte and essentially the same as some flavor of ASCII.
Probably for TMI on a Wednesday morning just before April Fool's Day!
Good luck getting a straight answer tomorrow!
Patrik, by the flag in metadata, do you mean the CCSID or the code page?
Other systems don't use those at all, there's basically no metadata that
I know of in text files. Now PK-ZIP files, they have PK in the first 2
bytes, other file types do similar markings. But not text files.
Regards
Vern
On 3/31/2021 9:22 AM, Patrik Schindler wrote:
Hello Dave,
Am 31.03.2021 um 12:27 schrieb Dave <dfx1@xxxxxxxxxxxxxx>:
I've managed to copy the data in a readable form.
Glad you found a solution.
Thanks to Rob and Patrik, also an interesting analogy I read on jam and jam
jar labels that gave me a vital clue (just because it said CCSID 850 on the
label does not guarantee that it is CCSID 850)!
Exactly. Notepad++ looks *inside* and makes guesses about the charset. If you have a file without any umlauts or accentual characters, it surely will tell you the content is supposed to be US-ASCII.
IBM i and predecessors have a flag in the file's metadata. But this is also no guarantee to contain the correct designation if something bad happened when uploading the file to IBM i.
It's a bit the same as when you rename a .jpg file to .doc. The OS thinks, it's a word file, but word (probably) can't read the binary JPEG format.
It comes down to find a way to make the text inside, and the metadata outside the file consistent.
:wq! PoC
As an Amazon Associate we earn from qualifying purchases.