Re: CHGPF locks db2 and web page -- MIDRANGE-L

On 11 Apr 2013 09:02, dale janus wrote:

You probably explained it. I may have orphaned the lock if I tried
to cancel the CHGPF job.

If indeed a lock was orphaned, one that is not visible via WRKOBJLCK, then it was almost surely due to a defect in the OS code.

I seem to recall that it was deemed acceptable for the non-commit DB recovery code path to leave its normal object locks pending a recovery being initiated for the interrupted request, such that only invocation terminations that also ended the job would have those locks implicitly dropped. IIRC that effect was due to there being no /invocation exit/ established for that code path. Without such a /cancel handler/ being established, only a request that either successfully ran to completion or failed due to handled exceptions, would back out those locks. Thus ENDRQS and unmonitored exceptions that caused termination of the request could leave any locks that had been obtained; acceptably. However any non-standard types of locks, i.e. locks other than the object and data locks, such as SLLs, were supposed to be /protected/ from EndRqs, specifically to ensure that they could not be orphaned due to a user request to end the invocation.

I really don't remember if I let it time out or not.

But if the CHGPF request had failed instead due to timing out trying to obtain all of the necessary locks, then as a /normal/ and monitored failure, the code should have dropped any locks the processing had been able to obtain before backing-out its attempt at forward progress. Another job should not encounter any conflicting locks for its requests against the file, if the job requesting the CHGPF had failed solely due to its inability to allocate the file; i.e. failed due to CPF3202 or CPF3203 being issued as the error, per a timeout for the CHGPF request obtaining the necessary locks to proceed with its work.

Probably if I had signed off from the session that ran the CHGPF,
the lock would have been released.

Yes. Although if the situation could be recreated on a test file, whereby the conflicting lock is not visible on WRKOBJLCK, then that is probably a defect that can be reported.

Or if I would have changed the heading using SQL or the database part
of ops navigator, but green screen commands die hard.

I think only performing the change operation under commitment control would have changed the outcome; the SQL requested with no isolation would perform effectively the same request as the CHGPF SRCFILE(named) when that source changes just column headings. That is presumed, solely due to the different implementation, for how locks are registered and removed in the commit vs non-commit code paths for database recovery. That leaves only LABEL ON to effect the request [under commitment control] because an SQL ALTER request does not give the option to change the column labels like the request to CHGPF SRCFILE(specified) does.

When a termination occurs and the work has been registered under commitment control, then the locks that had been obtained are dropped as part of the explicit or implicitly rolled back [ROLLBACK] for an interrupted request, or when the successfully completed operation is eventually committed [COMMIT].

We are running V7R1 and applied latest cum a few weeks ago.

The condition may be easy to recreate in a test environment; one that could use jobs that would not need to be the web interface, but might mimic what the web interface did. Such a recreate scenario could be submitted as a defect report to the service provider, in expectation of an APAR and PTF from IBM. Getting a PTF as preventive sure beats encountering the problem again, and could save others the same hassle.

I am still concerned that WRKOBJLCK did not show the problem,

As would I be... and likely indicative of a defect with the OS.

If the origin was a conflict with a held\orphan SLL, the nature of SLLs, as I recall, would not allow presentation via WRKOBJLCK very easily nor especially at all efficiently. A Space Location Lock is on any space, and is not specific to an /object/ as an allocated resource. And as I recall they are easiest obtained from the job, and why they are available via the Retrieve Job Locks (QWCRJBLK) API [and similar] based on MATPRLK but not via the List Object Locks (QWCLOBJL) API based on MATOBJLK. As noted in my earlier reply, I believe iNav has an interface to show SLLs that are held, perhaps also showing waiters, though most likely in an interface requesting for information about one or more /job/ vs an requesting information about an /object/ ; i.e. a /job/ interface vs an /object/ interface.? Anyhow...

An option to materialize a list of SLLs using a base address of any space object type as input would be nice. As it is, the specific address with offset [the specific location] must be requested to inquire of a list of any active holders. Otherwise all processes would have to be materialized for all of their held SLLs, and then paring down the list of addresses to those that share the base addresses of interest. If that were available via the LIC, then the database could inquire of all of the base addresses of its various space objects that make up the composite object of the database *FILE [for a request from the Work Control feature (WC) via WRKOBJLCK], to present the effects on an object-basis.

but I can understand it now due to the odd nature of my problem.

Odd, as in, likely defect. Not as in /understand/ that there is something that was done wrong; just that what was done, if EndRqs, might validly leave locks, but would not /validly/ leave locks that are not visible from WRKOBJLCK [based on my recollection of design intent for the OS database feature (DB)]

Were there any errors preceding the -913 in those jobs getting the SQL0913? Any such messages could assist to find the origin; e.g. MCH5804 "Lock space location operation not satisfied ..." vs MCH5802 "Lock operation for object &1 not satisfied" clearly diagnoses that what type of lock was origin for the conflict. The failing instructions identify exactly the code that requested the lock, and the code path in which that lock request is could make the reason the lock was requested very conspicuous; e.g. a preceding test in the OS code that says "if the mutex-like indicator is set, then request a read-SLL to ensure not to proceed until the SLL can be obtained" could be very revealing as to origin.

While I had suggested in my earlier reply that OPEN is unaffected by pending recovery, I seem to recall that an open by the SQL might have a protocol for delaying an open pending completion of certain identified-as /exclusive/ work for which a member or data lock might not be held to prevent the open, but work for which the SQL should probably await completion. And I suppose that exclusivity might have been implemented via a flag in the file [as a space, it can be changed irrespective of locking], which as an effective mutex informing the SQL that it must await completion of some changes, and perhaps that was implemented via Space Location Locks [SLL]; i.e. that location would have been locked by the CHGPF requester, an SLL obtained, then the SQL open would await a lock on that location if the exclusive-work flag was set. I seem to recall that some easy action would reset the flag in situations where the flag was improperly left on... perhaps something like DSPFD.?

FWiW here is a v6 issue describing the /change file/ interface as an example of the OS DB leaving an orphan lock. That example involved referential integrity, where the orphaned lock was left on the parent file vs the child file with the dependent data; no mention of the type of lock that was orphaned:
http://www.ibm.com/support/docview.wss?uid=nas3bd0dcd2b5f3164e28625772a0073bbb3

4refOnly:
http://www.google.com/search?q=%22space+location+lock%22+sql0913+OR+%22-913%22+OR+%22msgsql0913