On 24-Sep-2015 08:47 -0600, John McKee wrote:
Yesterday was the fourth week in a row that Go SAVE 21 failed. But,
very differently.
Minimally, the symptoms from the joblog should be offered. The term
"failed" is so nebulous as to ensure eliciting of no worthwhile
response, just inquiries for more information to elucidate the failure.
Following the misadventure where a tape got stuck, I ended up (since
time was already allocated) doing an IPL. That lowered max unprotect
from 20G to around 5G yesterday. Combine that with save/delete of a
number of libraries. No idea if there is any relation.
So some reduced storage required for the *NONSYS portion prior to
this save. That could have prevented the need for the tape-change
[volume-change] condition.?
But apparently no other attempts at preventive prior to that save;
i.e. no attempt was made prior to the save to do any of?:
1> mark the file named flightrec [in whatever directory that resides]
as AlwSav=No
2> disable\remove BRMS
3> disable\remove MSE
So, yesterday, I saw the SAV portion move along. Then, it froze for
a long time. Initially, I thought perhaps just a really big IFS
file. Maybe, but no idea. The progress line disappeared at some
point and then reappeared. That was followed by messages that
spooled output had been created for... a number of times. I think 21
spool files.Disk usage dropped as a result of IPL and removal of
libraries to the point that I thought it was low enough (from before)
that second tape would not be needed. There was NO message
requesting a tape change. Just failed backup, and device left FAILED
and damaged.
Without any preventive action taken prior to that save, the same
seize contention could occur for the flight recorder, but occur in a
different code path; the notable wait could have been the effect of the
seize-wait [similar to how a lock-wait is manifest].
While the request to log to the [BRMS] flight recorder was visible
from the tape-change in the prior failures, perhaps this time the [BRMS]
flight recorder logging is still occurring concurrent to the dump\save
of the descriptor with the flightrec file, but from a different code
path. And this time, the failing code path apparently has the First
Failure Data Capture (FFDC) active, for which the effect is:
But, this is different. This time, a problem was logged. The symptom
string is 5722 F/QTADMPDV.
A symptom that is generated as a result of FFDC, manifest as a
Problem Record visible in the Work With Problems (WRKPRB), and produced
if the Software Error Logging (QSFWERRLOG) system value says to do so,
in response to an unexpected failure.
I did attempt to Google this. But, I do not understand what I
received, as it appears to be related to MLB installation.
The generated symptom-string is both sparse and generic, from which
inferences from other issue with the same symptom are unlikely to be
telling.
The actual data would need to be reviewed to interpret the meaning
for the failure data being logged. IIRC the QTADMPDV is the program
[TA=Tape, DMP=Dump, DV=Device] of the Tape feature that will dump the
Device to include the Tape Flight Recorder information for that device
[and the corresponding tape MLB]; typically that is something the user
is asked to call, but the FFDC may invoke that feature in response to
whatever failure that feature was logging. The generated symptom may
merely indicate that a non-specific failure is logged to include
dump-tape info, rather than a failure of that dump-tape feature itself.
Just looked at PRTERRLOG. I decided to no longer use the "standard"
of WEEKLY for the volume id. The tape I used had one temporary write
error, with 138457 M Bytes written. This was the first time this
volume id had ever been used.
My guess, that is probably immaterial with regard to the failure.
To summarize: four fails, only last one logged a problem, last
failure also did NOT issue a tape change request, and there were
entries in LIC LOG much as before. The LIC LOG had no activity posted
prior to the failure, which is also different. The entries, to my
uneducated mind, appear to be same type as before.
My guess, this failure was the same issue, manifest from a different
place. Only by review of the spooled data and the VLog data, could that
be inferred with any confidence. Like data collection from the prior
failures, I would be willing to review the LIC data and the
aforementioned spooled data produced within the job, to try to infer
what transpired and then to describe that failure in terms more easily
digested.
Question: Is this one of *those* issues that violate that stupid
dogma "If it ain't broke...", and was there some PTF already
created?
Could be lack of maintenance allows the failure; that a preventive
PTF may exist, but is not applied. But again, personally, I would
ensure I had a good save of the system [even if not a GO SAVE opt-21;
something from which the system could be reloaded if necessary] before
applying maintenance. As I recall, the directions for applying
maintenance suggest a save before and one after [even if that almost
never happens]; being on a system that has effectively no support except
on a pay-as-you-go basis, that suggestion [despite any mis-recollection
I might have] is probably sage advice in that situation.
Also, from WRKPRB, F11 display APAR library, shows 45 entries.
Only two do NOT text. The text for the others has this: Problem
1526634180 and system serial number.
Any distant bells ringing?
I do not recall, and I do not have any access, to see what WRKPRB
looks like [there is a dearth of panels in docs, and so I will not even
waste time to look for any], so I have no idea what the big number is
nor why the same number would appear. Also I have no idea why any would
appear devoid of text.
Without some specifics about symptoms [e.g. from the joblog and other
spooled results, or the problem entries themselves], a reader is left
with nothing but wanting to ask for details, before any worthwhile
reply\comment could be composed.
As an Amazon Associate we earn from qualifying purchases.