Re: regexec - chopping a large text string into chunks -- RPG400-L

What is your pattern intended to do? I know you want to split the line up

every 40 characters, but... why not use %subst() to do that? I assume you
want to do something more, like split it up on a whitespace boundary, right?

Thanks for calling that out Scott. After re-reading my post I do realize I
left out key information. Making this code work is actually an attempt to
keep all the processing in RPG vs. Java (which my customer is thinking of
doing if I can't get this working)

The key criteria for what I need the regular expression to accomplish would
be:

1) I want to split a long text field into 40 byte records.

2) I don't want splits to happen in the middle of a word and if the 40th
byte is in the middle of a word it should go back to the previous space.

3) If a carriage return is found, that should also break the string to have
everything after the carriage return start on a new line.

Concerning the un-modern coding, I rarely use occurs so I didn't want to
change that line of code for fear of breaking something that would send me
on a wild good chase to fix.

I should also note that this same regular expression has been declared to
work from the Java environment. After digging around in the archives though
I see a number of people have found the IBMi notation might not be the same
as all other platforms (i.e. \w doesn't seem to work per others in the
archives, though I haven't tried it myself).

so it'll take the 40 characters rather than the one) followed by either

the end of the line ($) or the "s" character.

I believe the \s is short-hand for \t\r\n per this page:
http://www.regular-expressions.info/examples.html (look at the Trimming
Whitespace section). So maybe the short-hand doesn't work on the IBMi?
Here is the Java code that is working as expected on my PC:
http://code.midrange.com/db7764d7d5.html

That Java code produces the below output:
[[Begin]]
Compiled pattern:(\S\S{40,}|.{1,40})(\s+|$)
This is a line of text 111111. T2his i2s
al2so a2 lin2ewill eventuallyrunoverats
string is longer than normal. Somwersome
more text

over at some point simple stri.This is
the last sentence in the paragraph.
[[End]]

The modified RPG code (i..e %occur and option(*string) added) has the
following output for three MODS entries:
[[Begin]]
This is a line of text 111111. T2his i2s
This is a line of text 111111. T2his i2
s
[[End]]

Modified RPG code is here: http://code.midrange.com/492f7222d1.html

So given your evaluation of the regex results I am not sure which way to go
here because it works in Java but not on the IBMi. That tells me there are
some notation discrepancies between the two. So I am thinking I should
change the short hand stuff to be "long hand". Here is what I *believe* to
be the long hand version:

([a-zA-Z0-9]{40,}|.{1,40})(^[a-zA-Z0-9]+|$)

I replace \S (note capital S is negating whitespace) with [a-zA-Z0-9]. Note
I couldn't do two of those sequences of negated bracketed expressions
because the regex compiler didn't like the second one.

I replaced \s (lower case s) with [a-zA-Z0-9].

Doing that and re-running my RPG program gave me the following results (note
the blank last line is literal).
[[Begin]]
s is the last sentence in the paragraph.
s is the last sentence in the paragraph.

[[End]]

In the end I am after what I described in the three points at the top of
this email and any assistance is greatly appreciated as I am not well versed
in regex outside of the simple ones I find in XSD's.

TIA,
Aaron Bartell
http://mowyourlawn.com