Thanks Simon. I didn't realize u-umlaut takes 2 bytes in UTF-8.
I tested the example using a German setup (DEU/DE/273) and both built-ins
returned 'Jü', so I'm still convinced that both built-ins work as designed.
It would have been helpful if IBM documented what CCSID were they in when
they INSERTed the data into that UTF-8 field, as well as what CCSID they
were using when they ran the query and displayed the results. All of these
settings matter in this case.
And in typical IBM fashion they don?t advise any best practices (where to
use LEFT vs SUBSTR or vice versa).
Elvis
Celebrating 11-Years of SQL Performance Excellence on IBM i5/OS and OS/400
www.centerfieldtechnology.com
-----Original Message-----
Subject: Re: LEFT vs SUBSTR
From the InfoCentre under SUBSTR:
"1 The SUBSTR function accepts mixed data strings. However, because
SUBSTR operates on a strict byte-count basis, the result will not
necessarily be a properly formed mixed data string."
What you see is expected behaviour.
LEFT operates on characters. You only specify the number of them.
LEFT works out where they start and end therefore it implicitly
handles multi-byte characters.
SUBSTR operates on bytes. You specify the starting position and the
length in bytes therefore if the length stops in the middle of a
multi-byte character you will get crap returned.
Although u-umlaut appears to be a single character in UTF-8 it is
represented as multiple bytes.
CCSID 37 ü x'DC'
CCSID 819 ü x'FC'
CCSID 1208 ü x'C3BC'
Clear?
Mind you, I don't believe the E-acute is correct.
As an Amazon Associate we earn from qualifying purchases.