Re: RSTLIBBRM failed due to "false" DFRID -- MIDRANGE-L

On 03-Nov-2014 13:25 -0600, Steinmetz, Paul wrote:

<<SNIP>>
Below is snapshot of QADBRSDFRJ, after the failure, before
QSYS/RMVDFRID DFRID(*ALL) It matches the object in question.

File . . . : QADBRSDFRJ Library . : QRECOVERY
<<SNIP>>
*...+....1....+....2....+....3....+....4....+....5....+...
Q1ARSTID BRCAUDIT CRMB42010 3¬}ò¬X^ó¬¤¬¬¬¬¬¬¬¬¬¬¬BON¬¬
****** END OF DATA ******

I can find no metadata for that file, but I can infer from the msg data in the OP
[delimited with double quote chars]: " }ò X^ó¬¤ "
that the data displayed above as
[delimited with double quote chars]: "¬}ò¬X^ó¬¤¬"
is quite probably what is being utilized as a Library-name for the [library name of the journal on the] failing STRJRNPF request for the database *FILE object that is being pre-created by the program QDBRSPRE. I suppose probably then also, the next ten bytes might be used for the Journal name?

The Display Physical File Member (DSPPFM) results included above, do not include the hexadecimal code points for the non-displayable data; as best as I can guess, the 10-bytes of data from positions 32 to 41 is: X'¿¿D0CD¿¿E7B0CE5F9F¿¿' where the ¿ characters represent complete unknowns, whereas the other bytes are /guessed/ from the glyph that is presented for the respective position. Nothing about that data seems to give a conspicuous hint about its origins, so the actual hex values may not be of any use either; no harm in including that for future reference however. The output from DSPFFD to map the data to column\field names might be useful generally, given the dearth of info on the web; otherwise, only those with that file on their system are able to easily obtain that information.

Searching on the known hexadecimal digits that compose that invalid library name might be worthwhile to search, both within a dump of the library [created by restore] on the system and within a dump of the media LABEL for that library. If found, that might be helpful to infer how the improper data might have been addressed and then copied or based for reference as the library name to be stored in that file [and eventually referenced again for the failed STRJRNPF request].

The libraries are deleted, in a separate job, prior to the
RSTLIBBRM, libraries never exist.

Any chance the DLTLIB is followed by the improper but seemingly ubiquitous MONMSG CPF0000? Or run in a CL stream that otherwise\similarly ignores failures [of a severity level deemed insignificant]? Essentially the question is... Might the failing scenarios occur only when the restore is performed into an existing library rather than into the library created as part of the restore?

The file in error is actually successfully restored, everything
matches to the source LPAR.

That is normal for such /integrity/ and /attributes/ losses; i.e. merely diagnosed as a prior error, and causing the overall restore to /fail/ with a warning that although the object may have been restored successfully, some errors transpired.

Here's the latest update from IBM support.
The DataBase developer asked if you would accept a trap. It would
likely come in the form of an APAR to load onto the system just as
you would a PTF and then when the problem recurred, it would dump
additional data. This problem does not seem to be with the SR code
even though the call to the defer code is, the problem is that the
journal code is flagging that the file is journaled which is flagging
the SR code to defer the object. This is causing the entire restore
to have a dangling deferred object.

According to development:
"CPF3294 is sent in four places in the code. The normal one is an
exception handler, but that handler enqueues the previous failure
message which is not present in the joblog. So it must be one of the
other three places. This (the trap) will help us to narrow down the
problem by narrowing down which of the 3 remaining places is causing
this problem. Will the customer take a trap? It will not do anything
special or cause any problems; it will simply dump out an extra
message (CPF9898) in the joblog when the CPF3294 is sent."

Given the incidents are rare, applying that as a TESTFIX is probably a good approach; i.e. the dev have no re-create against which to perform debug in the lab, performing a re-create on the cust system while in debug is not available per not knowing if\when a re-create might eventually occur, so that leaves the /trap/ as possibly the best [effectively only /debug/] option.

Note: to be clear, a TESTFIX is not a TESTPTF, so PTF apply activity that would try to supersede that PTF will cause an attempt to apply such maintenance, and that may or may not be better than accidental permanent application of a PTF that would require a reload of the OS to revert; probably not an issue, but with "months", the possibility for issues with maintenance grows, and thus arranging pre-maintenance removal and then post-maintenance re-apply of the TestFix may be best incorporated into the change management until such time that the trap is no longer required and has been both permanently removed and deleted from the system to prevent re-loading that TestFix [just like what would be required to prevent accidental perm-apply of a TestPTF].

The problem is I cannot recreate this, I've tried.

It may take another 3 months before I see the issue.

With persistence and knowledge of the database restore processing, I do not recall being unable to recreate any issue; often the inability to re-create a restore failure is something simple that may even have been alluded in some messaging that occurred in the original failing scenario but not in the re-create attempt. If a particular library affected by the problem were able to be repeatedly restored, I expect eventually an attempt would be found to fail, and discovered also what would make that failure consistent. A production partition however, does not allow for creative attempts at repeated and possibly slightly modified setup\requests to effect a re-create; and whether the effort to re-create is even worth the payback is questionable, esp. given the object actually restores [and presumably entirely as-expected\correct; i.e. not journaled] even though [effectively innocuous] errors occurred.

FWiW: If the media used in a failing scenario had been sent to IBM to enable them to attempt to review the restore path for a file that exhibited the issue, then they might have been able to find or to infer something about the media or environmental factors that _might_ cause the code to make the apparently improper decision to start journaling of that file; i.e. even debug of a non-failing scenario is quite possibly helpful to infer how the actual failing scenario might have come about. That might have already occurred; I did not go back to past messages to see what any included "CPS discussion" snippets or other information might have suggested that was done. But it seems at this point, that the path(s) in which the library-name is obtained or addressed are what requires additional scrutiny.