Re: REGEXP_INSTR range -- MIDRANGE-L

The extract:

*[A-M] Range - match any character from A to M.
Thecharacters to include are determined by Unicode code point
ordering.[\u0000-\U0010ffff] Range - match all characters.*

is a definite warning to EBCDIC users (and certainly caused me no lack of
questions on actual implementation with ranges, which is pretty much why I
dropped out of the conversation earlier). In Unicode A-M are sequential
code points (as in ASCII) but in EBCDIC they are not. Using CCSID 37, A-I
are x'C1'-x'C9' and J-M are x'D1'-x'D4' with SHY (soft hyphen), ô, ö, ò, ó
and õ thrown into the middle at x'CA'-x'CF' with } coming in at x'D1'. So
an EBCDIC range check of x'C1'-x'D4' would catch some non A-M characters
(and should). With Unicode testing of hex values it would all depend on
exactly how the code works internally (presumably based on data type).

On Mon, Aug 16, 2021 at 3:36 AM Peter Dow <petercdow@xxxxxxxxx> wrote:

Wait, it just occurred to me. The documentation at

https://www.ibm.com/docs/en/i/7.2?topic=predicates-regexp-like-predicate#rbafzregexp_like__regexp_likecontrol

in "Table 4. Set Expressions (Character Classes)" lists

Example Description
[A-M] Range - match any character from A to M. The
characters to include are determined by Unicode code point ordering.
[\u0000-\U0010ffff] Range - match all characters.

and from https://chortle.ccsu.edu/FiniteAutomata/Section07/sect07_11.html,

"Rule 3. Ranges of Characters

To show a range of characters, use square backets and separate the
starting character from the ending character with a hyphen. For example,
[0-9] matches any digit. Several ranges can be put inside square
brackets. For example, [A-CX-Z] matches 'A' or 'B' or 'C' or 'X' or 'Y'
or 'Z'."

But apparently REGEXP_INSTR is treating [\x00-\x3f] as a list of 3
bytes, x'00', x'60', and x'3f'.

From what I've read after a lot of googling is that most regex
implementations deal with strings of characters, not bytes, and
therefore do not really support ranges of byte values.

On 8/15/2021 5:56 PM, John Yeung wrote:

On Sun, Aug 15, 2021 at 6:33 PM Peter Dow <petercdow@xxxxxxxxx> wrote:

values regexp_instr('abcdef-ghijk' || x'3f', '[\x00-\x3f]') returns 7,
which is what I expected.

values regexp_instr('- - - - - - - - - - -', '[\x00-\x3F]') returns 1,
which is NOT what I expected.

Why did you expect to find the hyphen in the first example but not in
the second?

John Y.

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: https://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives
at https://archive.midrange.com/midrange-l.

Please contact support@xxxxxxxxxxxxxxxxxxxx for any subscription related
questions.

Help support midrange.com by shopping at amazon.com with our affiliate
link: https://amazon.midrange.com