On 03-Nov-2014 13:25 -0600, Steinmetz, Paul wrote:
<<SNIP>>
Below is snapshot of QADBRSDFRJ, after the failure, before
QSYS/RMVDFRID DFRID(*ALL) It matches the object in question.
File . . . : QADBRSDFRJ Library . : QRECOVERY
<<SNIP>>
*...+....1....+....2....+....3....+....4....+....5....+...
Q1ARSTID BRCAUDIT CRMB42010 3¬}ò¬X^󬤬¬¬¬¬¬¬¬¬¬¬BON¬¬
****** END OF DATA ******
I can find no metadata for that file, but I can infer from the msg
data in the OP
[delimited with double quote chars]: " }ò X^ó¬¤ "
that the data displayed above as
[delimited with double quote chars]: "¬}ò¬X^󬤬"
is quite probably what is being utilized as a Library-name for the
[library name of the journal on the] failing STRJRNPF request for the
database *FILE object that is being pre-created by the program QDBRSPRE.
I suppose probably then also, the next ten bytes might be used for the
Journal name?
The Display Physical File Member (DSPPFM) results included above, do
not include the hexadecimal code points for the non-displayable data; as
best as I can guess, the 10-bytes of data from positions 32 to 41 is:
X'¿¿D0CD¿¿E7B0CE5F9F¿¿' where the ¿ characters represent complete
unknowns, whereas the other bytes are /guessed/ from the glyph that is
presented for the respective position. Nothing about that data seems to
give a conspicuous hint about its origins, so the actual hex values may
not be of any use either; no harm in including that for future reference
however. The output from DSPFFD to map the data to column\field names
might be useful generally, given the dearth of info on the web;
otherwise, only those with that file on their system are able to easily
obtain that information.
Searching on the known hexadecimal digits that compose that invalid
library name might be worthwhile to search, both within a dump of the
library [created by restore] on the system and within a dump of the
media LABEL for that library. If found, that might be helpful to infer
how the improper data might have been addressed and then copied or based
for reference as the library name to be stored in that file [and
eventually referenced again for the failed STRJRNPF request].
The libraries are deleted, in a separate job, prior to the
RSTLIBBRM, libraries never exist.
Any chance the DLTLIB is followed by the improper but seemingly
ubiquitous MONMSG CPF0000? Or run in a CL stream that
otherwise\similarly ignores failures [of a severity level deemed
insignificant]? Essentially the question is... Might the failing
scenarios occur only when the restore is performed into an existing
library rather than into the library created as part of the restore?
The file in error is actually successfully restored, everything
matches to the source LPAR.
That is normal for such /integrity/ and /attributes/ losses; i.e.
merely diagnosed as a prior error, and causing the overall restore to
/fail/ with a warning that although the object may have been restored
successfully, some errors transpired.
Here's the latest update from IBM support.
The DataBase developer asked if you would accept a trap. It would
likely come in the form of an APAR to load onto the system just as
you would a PTF and then when the problem recurred, it would dump
additional data. This problem does not seem to be with the SR code
even though the call to the defer code is, the problem is that the
journal code is flagging that the file is journaled which is flagging
the SR code to defer the object. This is causing the entire restore
to have a dangling deferred object.
According to development:
"CPF3294 is sent in four places in the code. The normal one is an
exception handler, but that handler enqueues the previous failure
message which is not present in the joblog. So it must be one of the
other three places. This (the trap) will help us to narrow down the
problem by narrowing down which of the 3 remaining places is causing
this problem. Will the customer take a trap? It will not do anything
special or cause any problems; it will simply dump out an extra
message (CPF9898) in the joblog when the CPF3294 is sent."
Given the incidents are rare, applying that as a TESTFIX is probably
a good approach; i.e. the dev have no re-create against which to perform
debug in the lab, performing a re-create on the cust system while in
debug is not available per not knowing if\when a re-create might
eventually occur, so that leaves the /trap/ as possibly the best
[effectively only /debug/] option.
Note: to be clear, a TESTFIX is not a TESTPTF, so PTF apply activity
that would try to supersede that PTF will cause an attempt to apply such
maintenance, and that may or may not be better than accidental permanent
application of a PTF that would require a reload of the OS to revert;
probably not an issue, but with "months", the possibility for issues
with maintenance grows, and thus arranging pre-maintenance removal and
then post-maintenance re-apply of the TestFix may be best incorporated
into the change management until such time that the trap is no longer
required and has been both permanently removed and deleted from the
system to prevent re-loading that TestFix [just like what would be
required to prevent accidental perm-apply of a TestPTF].
The problem is I cannot recreate this, I've tried.
It may take another 3 months before I see the issue.
With persistence and knowledge of the database restore processing, I
do not recall being unable to recreate any issue; often the inability to
re-create a restore failure is something simple that may even have been
alluded in some messaging that occurred in the original failing scenario
but not in the re-create attempt. If a particular library affected by
the problem were able to be repeatedly restored, I expect eventually an
attempt would be found to fail, and discovered also what would make that
failure consistent. A production partition however, does not allow for
creative attempts at repeated and possibly slightly modified
setup\requests to effect a re-create; and whether the effort to
re-create is even worth the payback is questionable, esp. given the
object actually restores [and presumably entirely as-expected\correct;
i.e. not journaled] even though [effectively innocuous] errors occurred.
FWiW: If the media used in a failing scenario had been sent to IBM to
enable them to attempt to review the restore path for a file that
exhibited the issue, then they might have been able to find or to infer
something about the media or environmental factors that _might_ cause
the code to make the apparently improper decision to start journaling of
that file; i.e. even debug of a non-failing scenario is quite possibly
helpful to infer how the actual failing scenario might have come about.
That might have already occurred; I did not go back to past messages
to see what any included "CPS discussion" snippets or other information
might have suggested that was done. But it seems at this point, that
the path(s) in which the library-name is obtained or addressed are what
requires additional scrutiny.
As an Amazon Associate we earn from qualifying purchases.