Re: Jobs that won't end -- MIDRANGE-L

On 16 Nov 2012 09:57, Needles,Stephen J wrote:

We have begun to experience a higher than normal rate of jobs that
fail to end.

Here is the scenario...

We have a scheduled job that will spin through the active pre-start
jobs (in our case: QRWTSRVR jobs that service our web consumers) with
OPTION(*CNTRLD). After a time, the OS will usually turn around and
end them *IMMED...usually.

What is the specific intention for doing so; i.e. what is intended to be achieved, by ending the job(s)?

In some cases, for an unknown reason, the transition from *CNTRLD to
*IMMED does not seem to occur. When this happens, the only way we've
found to kill them off is to do an ENDJOBABN.

If ENDJOBABN could be issued without a prior ENDJOB OPTION(*IMMED), then the system did effect already, the immediate end at the end of the Controlled-End Time Limit.

The program stack shows that they are in the process of some DB
cleanup.

If indeed the *IMMED end did never start, then the controlled end probably never finished because of that indefinitely long coded loop of /termination close/ processing preventing the terminate-immediate event from being processed.

The stack includes: QSQXIT (procedures: CLOSE_IT and QSQXIT), QSQBAS
(SQHRDCLS), QDMCLOSE (statement /0113), QDBCLOSE (statement /02B9),
QQQQEXIT (procedure QQQQEXIT and TIDYUP).

Super major *Ouch* situation BTW. Especially if the potential corruption issues have never been addressed for that type of failure, beyond just a minor sanity check added to Data Management Close to ensure the close addresses something that "looks like" what is expected as input. You will really want to involve IBM. I do not recall what were origin(s) for the problem [¿OS code mismatch?], only that the overly brutish and blind attempts by the Query Close processing was probably the worst excuse for OS code I ever saw :-( Inexcusable. The reason the loop is so long, is that there is absolutely no sanity check of the number-of-entries variable, and IIRC the value comes from based storage so any mess with the basing pointer could have that count and IIRC even its array-elements in effectively random PASA.

My questions...

Has anyone else experienced this?

If so, were you able to discover why (in our case, the number of
incidents increased since last weekend's build install, suggesting we
somehow contributed to the problem)?

Is there a cure?

I can find nothing that seems to address this in the archives.

Call IBM or your service provider [who should call IBM] ASAP!

This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.