Re: MCH5802 - Lock operation for object TRACE TRACE not satisfied. -- MIDRANGE-L

On 23-Mar-2016 08:36 -0500, Steinmetz, Paul wrote:

What would cause a MCH5802
"Lock operation for object TRACE TRACE not satisfied."

Specifically for the symptom msgMCH5802 F/RmslLockMiObject T/QDBOPEN TP/QDBOPEN STMT/19483 [obtained\constructed from joblog info included later], that would be a failing MI LOCK instruction [per timeout] for either the member, data[space], or access path with that name; i.e. essentially, for the member TRACE of database file TRACE. The OS msgCPF4126 RC1 F/QDBSIGEX clarifies that the object unable to be allocated was the keyed Access Path of the data for that member.

Notably, from the help text for the Access Path Recovery (RECOVER) keyword for the Create Physical File (CRTPF), "While the access path is being rebuilt, the file cannot be used by any job."; more appropriately, the keyed access *open* of the *member* of the file can not be performed by any *other* [thread or] job than the job performing the rebuild. From the details in the remainder of the topic, in which I further comment inline, I infer that is why the open did not transpire in a second\the-failing job.

I had two occurrences last night.
Both were following RSTLIBBRM(s) of test environment.
One job 811697 opened the same file 63 seconds earlier with no
issue.
According to the message details, the first job, 811697, which was
successful, was the lock holder.

Checking the joblog for the first job, 811697, I see it took 4
minutes, 32 seconds for the program to open its 4 files.
Normally this is 1 second.
Date sent . . . : 03/23/16 Time sent . . . : 02:29:56
Date sent . . . : 03/23/16 Time sent . . . : 02:34:28

I'm thinking it could be related to an access path being rebuilt
following the RSTLIBBRM.
What is puzzling is I don't see a CPF3145 Access path built for
member for the file in question.
I do see many other CPF3145.

The other msgs CPF3145 were seen where? In the joblog of the restore, in QSYSOPR, or in QHST via DSPLOG?

Presumably, the job 811697 that opened the member first, effected the AccPth rebuild activity. I expect possibly that the msg CPF3145 T/QSYSOPR is not an expected effect; perhaps that the message is sent, but only T/QHST. Either way, I am confident the message is not sent to the joblog of the opener. The CPF3145 messages seen, I suspect likely were sent from a QDBSRV## runpty-52 job, and were sent To *SYSOPR, for which an implicit copy also would have been sent To QHST. And depending on the outcome for the opener, the msg CPF3146 may have been the effect instead; albeit unlikely, as the program as opener; i.e. the first job likely would have failed also, but with a different open error.?

Anyhow, the presumed-successful opener of the member [with the keyed access open method] apparently won the race to effect the rebuild; in a race first against the Database Server jobs, to which the rebuild-work I expect would have been added asynchronously via an event as a side effect of the restore\load-data recognizing the need to rebuild, and then eventually that rebuild expected to be started as work pulled from a queue [that is the expected effect for a known-to-be required rebuild, for which Edit Rebuild Of Access Paths (EDTRBDAP) is the UI for showing\managing the enqueued work] and a race against the failing job that was also attempting to open the member.

The TRACE file in question only has 1 access path.
The file is quite large.

Member List
… Creation Last Change Deleted
Member Size … Date Date Time Records Records
TRACE 7329742848 … 02/05/05 03/23/16 08:33:24 5246938 18248817

Hmm. A physical file [member] rather than a logical file. The Access Paths of a physical file are saved [unless invalid] and restored irrespective the Save Access Paths (ACCPTH) specification on the save, so that the access path is apparently being rebuilt seems problematic; seems the saved file must have an invalid access path. As an LF Member (LFM), the Change Logical File (CHGLF) has a Force Rebuild Of Access Path (FRCRBDAP) parameter that could assist, but for a PF Member the options are to ALTER [or Change Physical File (CHGPF) with Source File (SRCFILE) other than *NONE] or Reorganize Physical File Member (RGZPFM) using Allow Cancel (ALWCANCEL) of *NO and Rebuild Access Paths (RBDACCPTH) of *YES [i.e. offline reorg].

If there is no CPF3145, perhaps the oddity\rarity of a physical access path is something that went unnoticed, as not transpiring; is perhaps, only functional for logical access paths.?

Hmm. 3X as many deleted, as there are active records. Having been created in 2005, if the file is established for Reuse Of Deleted Records per REUSEDLT(*YES), then the Deleted Entry BitMap may be broken [an old defect whereby recovery is either toggle the REUSEDLT attribute setting or issue an offline Reorganize Physical File Member (RGZPFM)] with that many deleted records; though that should have no impact on the AccPth, given deleted records are not tracked.

Subsystem class - Default wait time in seconds: 30

Access path maintenance . . .: MAINT *IMMED
Access path recovery . . . .: RECOVER *NO
Maximum file wait time . . .: WAITFILE *IMMED
Maximum record wait time . .: WAITRCD *NOMAX

Missing but possibly pertinent values are:
• Access Path Valid [also, for the file that was saved; see first of two aforementioned "Hmm."]
• Access Path Size (ACCPTHSIZ) (*MAX1TB might have been required to enable a valid access path being maintained against the larger [pre-delete] amount of data [to accommodate all key values] for a given Maximum Key Length)

I have a few unanswered questions.
1) What changed, we've never seen this MCH5802 error following a
RSTLIBBRM?

• The conflicting job is new or faster\different-timing than in the past, many more invalid\unsaved access paths are on the media delaying implicit\async rebuild, or there was not previously two concurrent jobs started both doing the keyed open [so soon] after the restore.
• The saved file no longer has a valid access path and thus must be rebuilt upon restore? The file being saved would need to have that remedied so as to allow the save to include the access path; see RGZPFM mentioned previously.
• Note(a) as FYI only: If the issue had been with an LF vs a PF, then given the restore is not a scratch restore of the member data [such that instead, is overlay\restore-over existing data] then perhaps a mismatch was noted between the level identifiers of the saved AccPth and the existing AccPth [for which a scratch restore should be performed to prevent the condition]; or, irrespective scratch\restore-over, if the LFM was not restored in the same request as the PFM.
• Note(b) as FYI only: If the issue had been with an LF vs a PF, then the save may no longer be saving access paths by choice or by default; i.e. the Save Access Paths (ACCPTH) parameter specification? All PF should have all valid access paths saved, both those of each PFM and each LFM [for why a LFM must be saved with the PFM(s) in the same save request and why a LFM must be restored with the PFM(s) in the same restore request to enable the logical keyed access path to restore rather than having to be rebuilt]

2) Why was the first job successful and 2nd job failed?

A side effect of a race to open with a required AccPth rebuild is that the loser may experience the timeout in attempting to allocate the object(s) required to effect the rebuild or effect the open, and thus fail to open the member. The length of time to wait for the LOCK instruction could be set to ~60 rather than the Default Wait (DFTWAIT), and despite being an Open request, the specified length of time for timeout may be either in-addition-to or otherwise distinct-from the Maximum File Wait Time (WAITFILE).

3) If TRACE01 access path was rebuilt, why did the system not log a
CPF3145?

Presumably that is file TRACE; as in the file named in the error.? That is, presumably the correct name was being researched when looking for a corresponding CPF3145.?

That may be normal for a user-initiated open; i.e. for an access path rebuild as the effect of Open Database File (OPNDBF) or via the equivalent Keyed Database Member Open method by another means, the CPF3145 [is not issued to the joblog and] may only be issued to the history log [thus not visible in QSYSOPR, if that is where the messages were being viewed]. Be sure to review the Display Log (DSPLOG) for QHST to locate any CPF31## messages, instead of using only the Display Message (DSPMSG) for *SYSOPR.

I read many threads related to MCH5802, discuss increasing the 30
second class default wait time.

The msg MCH5802 is generic; any condition is specific to the code that is requesting the lock, and what processing that code is doing. The Class (CLS) object may not be germane for any particular instance.

The Default Wait Time (DFTWAIT) is used only on LOCK instructions that do not specify an explicit TimeOut value. When the Open processing attempts to perform the open, the determination that the AccPth needs rebuilding likely results in an attempt to obtain an exclusive lock before preceding; that time is likely not considered for open-timeout and thus likely given a hard-coded timeout irrespective the DftWait timeout. If that lock can be obtained, the rebuild can be attempted before the real work for Open proceeds. Had there been no AccPth rebuild requirement, the effective Create Duplicate Object (CRTDUPOBJ) of the member object used to create the temporary Open Data Path (ODP) probably would have been performed, and that is the phase of processing for which the WaitFile timeout value has real meaning.

MCH5802 Escape 20 03/23/16 02:30:59.863601
< ckMiObject 001368 QDBOPEN QSYS *STMT
From Program : RmslLockMiObject
To module . .: QDBOPEN
To procedure : QDBOPEN
Statement . .: 19483
Message: Lock operation for object TRACE TRACE not satisfied.
Cause . . : The lock operation was not satisfied for object
TRACE TRACE in the specified time interval of 63 seconds.
The lock holder type is 1.
The lock holder name is CURBSTONE MISSLICE 811697.
The lock holder thread identifier is X'000000000000007F'. The
lock holder type has the following meaning: [...]
1 - The lock holder is a job. [...]
CPF4126 Escape 50 03/23/16 02:30:59.864023
QDBSIGEX QSYS 05A4 AICURB IC08XXCHGP *STMT
To module . .: AICURB
To procedure : AICURB
Statement . .: 1000001
Message: Object not available for file TRACE.
Cause . . : Cannot allocate object for file TRACE in library
MSPENWORK because object not available. The object is specified
in reason code 1 as follows: 1 -- Access path for Member TRACE
<<SNIP>>
RNX1216 Escape 99 03/23/16 02:31:00.712231
QRNXIE QSYS *STMT AICURB IC08XXCHGP *STMT
From module .: QRNXMSG
From procedure: SignalException
Statement . .: 26
To module . .: AICURB
To procedure : AICURB
Statement . .: 1000001
Message: Error message CPF4126 appeared during OPEN for file TRACE.
Cause . . : RPG procedure AICURB in program IC08XXCHGP/AICURB
received the message CPF4126 while performing an implicit OPEN
operation on file TRACE. The actual file is TRACE. <<SNIP>>
RNQ1216 Sender copy 99 03/23/16 02:31:01.045936
QRNXIE QSYS *STMT QRNXIE QSYS *STMT
From module .: QRNXMSG
From procedure: InqMsg
Statement . .: 15
To module . .: QRNXMSG
To procedure : InqMsg
Statement . .: 15
Message: Error message CPF4126 appeared during OPEN for
file TRACE (C S D F).
Cause . . : RPG procedure AICURB in program IC08XXCHGP/AICURB
received

If the timeout [seen in the failure] is additive to the WAITFILE for the complete Open processing, then increasing that value [shown as *IMMED, meaning no-wait-timeout] on the saved file [or on restored-to file, if using restore-over vs scratch-restore of the file] could assist to prevent the timeout.? I expect more likely, the condition(s) for which the access path must be rebuilt are best resolved; e.g. if not saving access paths, then changing to do so [along with any other required corrections such as ensuring that the AccPth of the saved file is valid] would be the ideal resolution. Ensuring the AccPths are saved\restored prevents need for any rebuilds and thus should avoid any potential conflicts for attempts to rebuild; excepting conditions whereby the access path is marked invalid after restore due either to defect or defect-corrective-action, but that seems highly improbable as possible origin in the described scenario. But as noted earlier, being a PF, an issue for a PFM AccPth being rebuilt is apparently due to the AccPth being invalid on the saved-from member [and thus the AccPth was omitted from the media]; ensuring the PFM AccPth gets saved\restored, requires correcting the condition of the invalid access path whence the file\member is being saved; if the AccPth of the saved file claims to be valid, then something else\weirder is going on [like alluded "conditions", just above].