Re: System Hung -- MIDRANGE-L

On 24 Apr 2012 05:13, Jerry C. Adams wrote:

I was in the process of looking through the job logs and history
yesterday when I had to leave (they kick me out at 1430!). But, as
you surmised, the ones I have looked at reveal nothing; even the one
that applied to my sessions.

I do not recall, having been too long since I have looked at such an incident, but I think the history will show effective "job ended" messages for the jobs that were active when the system went down. However those messages will appear in the history _after_ the system started the IPL; i.e. after the start of some of the system jobs are recorded in the history. I believe the timestamp of the joblogs will also be from during the IPL; I think the start of QSPLMAINT is what precedes the spooling of /incomplete jobs/ from prior to the system crash. I would also do a WRKSPLF (*ALL *ALL *ALL *ALL), generally with OUTPUT(*PRINT), to review for any [esp. many dump spools] from just before the apparent hang; e.g. a job looping may have produced a job dump for each iteration, or maybe just one error and then its attempt to recover took-a-dive.

There was about an hour gap, which stopped when I started the inquiry
of the error message I was investigating and started up when
everything started shutting down.

A gap from before both the F1\F10 when the system appeared to hang and before the forced IPL? The VLogs from before the forced IPL would likely be of interest; as I recall the descriptive text sometimes make the error [versus just diagnostic] logging somewhat conspicuous, and the really bad ones often include a process dump which includes the full job name for which WRKJOB JOB(name_from_dump) OPTION(*SPLF) might reveal QPSRVDMP, QPDSPJOB, etc.

There is also WRKPRB data, which if retained might have some diagnostic data for an issue that was identified before the system had its difficulties.

FWiW, being v5r1, I would keep an eye out for indications of restore and\or index rebuild, or access path invalidation. The QDBSRV01 job runs at a better priority default than the console IIRC, and a problem in that job manifest as a loop would likely appear similar to the described. The EDTRCYAP may be worth a review for possible adjustment, and the EDTRBDAP screen should have appeared on the manual IPL; though if the the event handling job were looping instead of properly processing, no AccPth would necessarily ever be added to the list.

By the way, I know that, when the system registers an abnormal end
(QABNORMSW = 1), this means that two IPLs will be necessary when
applying PTFs. But at V5R1 (sigh) I don't think that's going to
happen. Is there any other downside to leaving the system in this
state, or should I re-IPL to get it back to normal?

The QABNORMSW is just historical, now that the IPL is done. Any impact the switch had was during that IPL; e.g. if PTFs had been scheduled to apply, but perhaps were not, due to any caution\concern by the PTF handler [PZ component]. Any code that reviews the status should understand the system is not IPLing [when the value is truly pertinent], and treat the value as merely historical. The spooled SCPF joblog [not the currently active SCPF joblog], produced at the end of the IPL, would record any PTF activity. WRKSPLF (QSYS *ALL *ALL SCPF) is how I would locate the spooled QPJOBLOG from the most recent IPL.

Regards, Chuck