Maximum CPU Time and Maximum Temporary Storage settings to protect against run-away jobs -- MIDRANGE-L

I would like to protect our system against run-away jobs.

Adjust QAQQINI: SELECT QQVAL FROM cancom/QAQQINI where QQPARM =
'QUERY_TIME_LIMIT '

Specifies a time limit for database queries allowed to be started based
on the estimated number of elapsed seconds

Adjustment on Maximum CPU time that a job can use

Adjustment on Maximum Temporary Storage that a job can use

Seems like adjusting these setting downward to the point of causing jobs
to fail is fraught with risk, depending on which job goes into
message-wait status and ends.

Recommendations?

++++++++++++++++++++++++++++++++++++++++++++==

The Maximum CPU Time and Maximum Temporary Storage that a job can use
are defined in the class object named by the routing entry in the
subsystem description.

o When a job reaches the maximum processing unit time, it is ended with
message CPC1218.
o When a job reaches the maximum temporary storage allowed, it is ended
with message CPC1217.

In some cases, the jobs would be successful if allowed a bit more CPU
time or storage. The jobs should be held rather than ended and the
system operator should be notified.

PTF SI42845 for APAR SE45779 changes how IBM i manages jobs that exceed
their CPU or storage limits.

The class object defines the processing attributes for a job. The
routing entry in the subsystem description is used to determine which
class object is used when a job is initiated. Two of these processing
attributes within the class object are Maximum processing unit time
(CPUTIME) and Maximum temporary storage allowed (MAXTMPSTG), which both
have default values of *NOMAX. Prior to this recent PTF, if values were
entered for these parameters, the job would be ended if one of the
limits was hit. The cause for each of these messages (CPC1218 , CPC1217
) tells you whether the job ended abnormally due to the maximum CPU time
being consumed or the maximum temporary storage limit being exceeded.

The system can not know if the job was actually near the completion of
the work it had to do when it would end the job. It is possible that
given a little more CPU time or temporary storage, the job would be able
to run to completion. Because of the difficulty in predicting the upper
CPU or temporary storage limits required by a job, along with the fact
that the job would be ended when these limits were hit, many customers
simply left these values at their default setting.

The above PTF that was recently released changes the behavior so that
jobs are no longer ended when they have exceeded their maximum
processing unit time or their maximum temporary storage limit. Rather,
the jobs will be held. When a job is held by the system due to these
conditions, a message will be sent to the QSYSOPR message queue:

o CPI112D - Job held by the system, CPUTIME limit exceeded
o CPI112E - Job held by the system, MAXTMPSTG limit exceeded

This change allows the system operator to determine whether the jobs
should be ended or if they should be allowed to continue to run to
completion.

If you want the jobs to continue to run, you must change the limit that
was met and then use the Release Job (RLSJOB ) command (you can not
release a job that is above the limit). To allow these values to be
changed, the Change Job command and the Change Job APIs have been
enhanced.

The Change Job (CHGJOB ) command has been enhanced with two new
parameters:

o

Maximum CPU time (CPUTIME): The maximum CPU time parameter specifies the
maximum processing unit time (in milliseconds) that the job can use. If
the maximum time is exceeded, the job is held.

o

Maximum temporary storage (MAXTMPSTG): The maximum temporary storage
parameter specifies the maximum amount of temporary auxiliary storage
(in megabytes) that the job can use. This temporary storage is used for
storage required by the program itself and by implicitly created
internal system objects used to support the job. (It does not include
storage for objects in the QTEMP library.) If the maximum temporary
storage is exceeded, the job is held.

The Change Job (QWTCHGJB) API has been enhanced to support two new keys
on the JOBC0100 and JOBC0200 formats:

o Maximum processing unit time allowed, in milliseconds (1302)
o Maximum temporary storage allowed, in megabytes (1305)

This PTF makes it easier to protect your system from the effects of a
run-away job that either consumes more CPU than expected or uses more
temporary storage than expected. By setting these limits larger than
what any job should use, you can protect the system from the potentially
negative affects of a run-away job. Because the job will be held rather
than ended, the limits do not need to be set perfectly. If either limit
is hit, you can increase the limit with the Change Job command or API,
and then release the job to allow it to continue to run. If the new
upper limit is met, the system will once again hold the job.

With the change introduced with this PTF, you should start to move away
from the default *NOMAX values and set appropriate limits. Particularly
with the temporary storage limit, you can prevent a system outage by
setting an upper limit on the class object for the maximum temporary
storage that a job can use (be sure to keep that limit lower than the
amount of storage available on the system). With the new behavior of the
job being held when the limit is hit, you have the capability to assess
and determine the best action for the job.