RE: Maximum CPU Time and Maximum Temporary Storage settings to protect against run-away jobs -- MIDRANGE-L

I had this problem once with users that were prone to write bad queries. We used the MAXSTG parameter on all their user profiles. This way they could continue to execute valid long running queries however it would blow up if a temporary file or something else got out of hand (which is usually the result of a run-away query). Probably not a perfect solution, but it certainly saved me a bunch of weekend phone calls.

***********************************
Bradford Lovelady

Operating Systems Engineer
Technology Infrastructure Services

Wells Fargo Bank l 200 Wildwood Pkwy l Birmingham, AL 35209
MAC W2691-010
Tel 205-938-1999 l Cell 205-826-2834

brad.lovelady@xxxxxxxxxxxxxx

Wells Fargo Confidential

This message may contain confidential and/or privileged information. If you are not the addressee or authorized to receive this for the addressee, you must not use, copy, disclose, or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation.

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Eric Lehti
Sent: Friday, February 08, 2013 3:02 PM
To: Midrange Systems Technical Discussion
Subject: Maximum CPU Time and Maximum Temporary Storage settings to protect against run-away jobs

I would like to protect our system against run-away jobs.

Adjust QAQQINI: SELECT QQVAL FROM cancom/QAQQINI where QQPARM =
'QUERY_TIME_LIMIT '

Specifies a time limit for database queries allowed to be started based on the estimated number of elapsed seconds

Adjustment on Maximum CPU time that a job can use

Adjustment on Maximum Temporary Storage that a job can use

Seems like adjusting these setting downward to the point of causing jobs to fail is fraught with risk, depending on which job goes into
message-wait status and ends.

Recommendations?

++++++++++++++++++++++++++++++++++++++++++++==

The Maximum CPU Time and Maximum Temporary Storage that a job can use are defined in the class object named by the routing entry in the subsystem description.

o When a job reaches the maximum processing unit time, it is ended with message CPC1218.
o When a job reaches the maximum temporary storage allowed, it is ended with message CPC1217.

In some cases, the jobs would be successful if allowed a bit more CPU time or storage. The jobs should be held rather than ended and the system operator should be notified.

PTF SI42845 for APAR SE45779 changes how IBM i manages jobs that exceed their CPU or storage limits.

The class object defines the processing attributes for a job. The routing entry in the subsystem description is used to determine which class object is used when a job is initiated. Two of these processing attributes within the class object are Maximum processing unit time
(CPUTIME) and Maximum temporary storage allowed (MAXTMPSTG), which both have default values of *NOMAX. Prior to this recent PTF, if values were entered for these parameters, the job would be ended if one of the limits was hit. The cause for each of these messages (CPC1218 , CPC1217
) tells you whether the job ended abnormally due to the maximum CPU time being consumed or the maximum temporary storage limit being exceeded.

The system can not know if the job was actually near the completion of the work it had to do when it would end the job. It is possible that given a little more CPU time or temporary storage, the job would be able to run to completion. Because of the difficulty in predicting the upper CPU or temporary storage limits required by a job, along with the fact that the job would be ended when these limits were hit, many customers simply left these values at their default setting.

The above PTF that was recently released changes the behavior so that jobs are no longer ended when they have exceeded their maximum processing unit time or their maximum temporary storage limit. Rather, the jobs will be held. When a job is held by the system due to these conditions, a message will be sent to the QSYSOPR message queue:

o CPI112D - Job held by the system, CPUTIME limit exceeded o CPI112E - Job held by the system, MAXTMPSTG limit exceeded

This change allows the system operator to determine whether the jobs should be ended or if they should be allowed to continue to run to completion.

If you want the jobs to continue to run, you must change the limit that was met and then use the Release Job (RLSJOB ) command (you can not release a job that is above the limit). To allow these values to be changed, the Change Job command and the Change Job APIs have been enhanced.

The Change Job (CHGJOB ) command has been enhanced with two new
parameters:

o

Maximum CPU time (CPUTIME): The maximum CPU time parameter specifies the maximum processing unit time (in milliseconds) that the job can use. If the maximum time is exceeded, the job is held.

o

Maximum temporary storage (MAXTMPSTG): The maximum temporary storage parameter specifies the maximum amount of temporary auxiliary storage (in megabytes) that the job can use. This temporary storage is used for storage required by the program itself and by implicitly created internal system objects used to support the job. (It does not include storage for objects in the QTEMP library.) If the maximum temporary storage is exceeded, the job is held.

The Change Job (QWTCHGJB) API has been enhanced to support two new keys on the JOBC0100 and JOBC0200 formats:

o Maximum processing unit time allowed, in milliseconds (1302) o Maximum temporary storage allowed, in megabytes (1305)

This PTF makes it easier to protect your system from the effects of a run-away job that either consumes more CPU than expected or uses more temporary storage than expected. By setting these limits larger than what any job should use, you can protect the system from the potentially negative affects of a run-away job. Because the job will be held rather than ended, the limits do not need to be set perfectly. If either limit is hit, you can increase the limit with the Change Job command or API, and then release the job to allow it to continue to run. If the new upper limit is met, the system will once again hold the job.

With the change introduced with this PTF, you should start to move away from the default *NOMAX values and set appropriate limits. Particularly with the temporary storage limit, you can prevent a system outage by setting an upper limit on the class object for the maximum temporary storage that a job can use (be sure to keep that limit lower than the amount of storage available on the system). With the new behavior of the job being held when the limit is hit, you have the capability to assess and determine the best action for the job.

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list To post a message email: MIDRANGE-L@xxxxxxxxxxxx To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx Before posting, please take a moment to review the archives at http://archive.midrange.com/midrange-l.