Re: Failed GO SAVE 21 -- MIDRANGE-L

On 24-Sep-2015 08:47 -0600, John McKee wrote:

Yesterday was the fourth week in a row that Go SAVE 21 failed. But,
very differently.

Minimally, the symptoms from the joblog should be offered. The term "failed" is so nebulous as to ensure eliciting of no worthwhile response, just inquiries for more information to elucidate the failure.

Following the misadventure where a tape got stuck, I ended up (since
time was already allocated) doing an IPL. That lowered max unprotect
from 20G to around 5G yesterday. Combine that with save/delete of a
number of libraries. No idea if there is any relation.

So some reduced storage required for the *NONSYS portion prior to this save. That could have prevented the need for the tape-change [volume-change] condition.?

But apparently no other attempts at preventive prior to that save; i.e. no attempt was made prior to the save to do any of?:
1> mark the file named flightrec [in whatever directory that resides] as AlwSav=No
2> disable\remove BRMS
3> disable\remove MSE

So, yesterday, I saw the SAV portion move along. Then, it froze for
a long time. Initially, I thought perhaps just a really big IFS
file. Maybe, but no idea. The progress line disappeared at some
point and then reappeared. That was followed by messages that
spooled output had been created for... a number of times. I think 21
spool files.Disk usage dropped as a result of IPL and removal of
libraries to the point that I thought it was low enough (from before)
that second tape would not be needed. There was NO message
requesting a tape change. Just failed backup, and device left FAILED
and damaged.

Without any preventive action taken prior to that save, the same seize contention could occur for the flight recorder, but occur in a different code path; the notable wait could have been the effect of the seize-wait [similar to how a lock-wait is manifest].

While the request to log to the [BRMS] flight recorder was visible from the tape-change in the prior failures, perhaps this time the [BRMS] flight recorder logging is still occurring concurrent to the dump\save of the descriptor with the flightrec file, but from a different code path. And this time, the failing code path apparently has the First Failure Data Capture (FFDC) active, for which the effect is:

But, this is different. This time, a problem was logged. The symptom
string is 5722 F/QTADMPDV.

A symptom that is generated as a result of FFDC, manifest as a Problem Record visible in the Work With Problems (WRKPRB), and produced if the Software Error Logging (QSFWERRLOG) system value says to do so, in response to an unexpected failure.

I did attempt to Google this. But, I do not understand what I
received, as it appears to be related to MLB installation.

The generated symptom-string is both sparse and generic, from which inferences from other issue with the same symptom are unlikely to be telling.

The actual data would need to be reviewed to interpret the meaning for the failure data being logged. IIRC the QTADMPDV is the program [TA=Tape, DMP=Dump, DV=Device] of the Tape feature that will dump the Device to include the Tape Flight Recorder information for that device [and the corresponding tape MLB]; typically that is something the user is asked to call, but the FFDC may invoke that feature in response to whatever failure that feature was logging. The generated symptom may merely indicate that a non-specific failure is logged to include dump-tape info, rather than a failure of that dump-tape feature itself.

Just looked at PRTERRLOG. I decided to no longer use the "standard"
of WEEKLY for the volume id. The tape I used had one temporary write
error, with 138457 M Bytes written. This was the first time this
volume id had ever been used.

My guess, that is probably immaterial with regard to the failure.

To summarize: four fails, only last one logged a problem, last
failure also did NOT issue a tape change request, and there were
entries in LIC LOG much as before. The LIC LOG had no activity posted
prior to the failure, which is also different. The entries, to my
uneducated mind, appear to be same type as before.

My guess, this failure was the same issue, manifest from a different place. Only by review of the spooled data and the VLog data, could that be inferred with any confidence. Like data collection from the prior failures, I would be willing to review the LIC data and the aforementioned spooled data produced within the job, to try to infer what transpired and then to describe that failure in terms more easily digested.

Question: Is this one of *those* issues that violate that stupid
dogma "If it ain't broke...", and was there some PTF already
created?

Could be lack of maintenance allows the failure; that a preventive PTF may exist, but is not applied. But again, personally, I would ensure I had a good save of the system [even if not a GO SAVE opt-21; something from which the system could be reloaded if necessary] before applying maintenance. As I recall, the directions for applying maintenance suggest a save before and one after [even if that almost never happens]; being on a system that has effectively no support except on a pay-as-you-go basis, that suggestion [despite any mis-recollection I might have] is probably sage advice in that situation.

Also, from WRKPRB, F11 display APAR library, shows 45 entries.
Only two do NOT text. The text for the others has this: Problem
1526634180 and system serial number.

Any distant bells ringing?

I do not recall, and I do not have any access, to see what WRKPRB looks like [there is a dearth of panels in docs, and so I will not even waste time to look for any], so I have no idea what the big number is nor why the same number would appear. Also I have no idea why any would appear devoid of text.

Without some specifics about symptoms [e.g. from the joblog and other spooled results, or the problem entries themselves], a reader is left with nothing but wanting to ask for details, before any worthwhile reply\comment could be composed.