Re: Source Search -- WDSCI-L

I also expected REGEXP_COUNT to be slower than the API, but I did not expect that it would be that much slower.

Thomas.

-----Ursprüngliche Nachricht-----
Von: WDSCI-L [mailto:wdsci-l-bounces@xxxxxxxxxxxxxxxxxx] Im Auftrag von Mark Murphy
Gesendet: Montag, 2. Dezember 2019 21:07
An: Rational Developer for IBM i / Websphere Development Studio Client for System i & iSeries
Betreff: Re: [WDSCI-L] Source Search

REGEXP_COUNT uses International Components for Unicode, but if you use the
C API for that rather than going through the database, I would expect that
to be much faster.

On Sun, Nov 3, 2019 at 11:31 AM Tools/400 <thomas.raddatz@xxxxxxxxxxx>
wrote:

FYI: I cannot make regcomp() and regexec() working with character
classes such as "\s". I tried various things without success. Using
REGEXP_COUNT works like a charm but is incredible slow (200 times slower
than regexec()).

Therefore I posted the problem at the rpg400-l mailing list hoping to
get help there: "Regular expression (regcomp()) ccsid issue".

Thomas.

Am 02.11.2019 um 11:26 schrieb Tools/400:

Craig,

Interesting stuff. Thank you for letting us know.

Because of the "\s" issue, I assume that it is a ccsid problem. That is
what needs to be debugged. I hope that I can do that today or tomorrow.

Regards,

Thomas.

Am 01.11.2019 um 17:48 schrieb Craig Richards:

A slightly more efficient version might be

dcl-f(?>\s+)filea

or in your case
dcl-f(?> +)filea
or
dcl-f(?>[ ]+)filea

(I'm very surprised the \s suggested by David did not work - that's

pretty

standard stuff).

Essentially this is wrapping the one-or-more whitespaces \s+ with (?>)
which is called Atomic Grouping.

The \s+ is greedy which is to say it will grab as many whitespace
characters as it can and then look at the next part of the expression to
carry on matching (in your case the filea) If that fails to match, it

will

backtrack, so if it grabbed 3 spaces, it will drop one and then try to
match and so on until it can't backtrack anymore.

The atomic grouping stops that backtracking process - essentially once

it

gets past the closing parenthesis, it throws away all states so it

doesn't

go back and try with, say 2 spaces then one space.

Maybe not an issue for you and maybe not supported if not even \s is
supported but it's a good performance thing to be aware of for the
situations where it's obvious that once you've done a greedy match and

the

next bit has failed - there is no point in dropping the last character

of

the greedy match and retrying the expression again.

regards,
Craig

--
This is the Rational Developer for IBM i / Websphere Development Studio
Client for System i & iSeries (WDSCI-L) mailing list
To post a message email: WDSCI-L@xxxxxxxxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: https://lists.midrange.com/mailman/listinfo/wdsci-l
or email: WDSCI-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives
at https://archive.midrange.com/wdsci-l.

Help support midrange.com by shopping at amazon.com with our affiliate
link: https://amazon.midrange.com