Re: Unusual DB error -- MIDRANGE-L

On 18-Dec-2014 16:06 -0600, Thomas Garvey wrote:

We have a program that writes records into an externally defined
database file that has received the following error...

<<SNIPped joblog: rewritten for compactness, symptoms,
and adding some explanation>>

MCH3203 DbReuseFind 000FB4 QDBPUT 06B1
12/18/14 08:52:28.640853
Message . . . . : Function error X'1720' in machine instruction.
Internal dump identifier (ID) 0300167E.

msgMCH3203 f/DbReuseFind x/000FB4 T/QDBPUT x06B1
rcx1720 rc1720 vl02001720

The LIC DB code that attempted to locate a location for a [set of] deleted record(s) failed; a VLIC Log (VLog) was generated [major code 0200 for "function check" and minor code 1720 [for an effective "failed assumption] that is logged to record the failure. The failed attempt was a "put", aka a write or insert; intuitively, the dataspace appeared to support Reuse Deleted Records (REUSEDLT), and thus the LIC DB was looking for a place to insert the record(s) into a slot held by a since-deleted row. Something about this search went awry, the error which likely is an assertion that something should have been, but was found not to have been [i.e. failed assumption], so rather than progressing in spite of the bad information [which could have led to GIGO], the operation was canceled to protect the integrity of the data.

CPF9999 QMHUNMSG *N QDBPUT 06B1
Message . . . . : Function check. MCH3203 unmonitored by QDBPUT
at statement *N, instruction X'06B1'.

msgCPF9999 F/QMHUNMSG t/QDBPUT x/06B1

The LIC DB module that was invoked to implement the /insert-row/ method is, at that fix-level of the OS DB program QDBPUT, occurs at the failing instruction x/06B1 [for whatever specific type of write activity, according to the code path].

The generic "function check" condition, aka the *FC, indicates that the OS Database "Put" program did not monitor for the failure... which is correct, there is really nothing the DB2 Insert feature can do when the underlying database support fails, so the message be manifest as an "unmonitored failure" which also enables some internal diagnostic messaging support to better record the failure.

CPF3698 QSCPUTR 005E QMHAPD 0500
Message . . . . : Dump output directed to spooled file 1,
job 875639/JOBNAME/E206151053 created on system ITEAM on...

msgCPF3698 F/QSCPUTR x/005E T/QMHAPD x/0500

As part of Software Logging [SC: ¿Service Component?] for software errors, the prior /unmonitored/ condition is further logged as a "software problem". Really not germane, just a record that the OS was told to log, so that was done, and the effect was logged in the joblog.

CPF5257 QDBSIGEX 0539 E20615105R WORKLIB STMT/2123
Message . . . . : Failure for device or member E206151053 file
E206151053 in library WORKLIB.
Cause . . : An error occurred during a read or write operation.

msgCPF5257 F/QDBSIGEX x/0539 t/usrpgm

The user program issuing the failing statement, a WRITE, INSERT, whatever, is: E20615105R in WORKLIB STMT/2123

The failed I/O must be informed of, to the requesting program. The I/O error is signaled by the generic "Signal Exception" routine of the OS Database component, using the Common Data Management [DM] message range of messages, because the I/O was via the Open Data Path (ODP) created for the Database Open request; the member, file, and library, are offered up as part of the information, and that the operation was a "write" vs a read can be inferred from the OS DB program name.

I examined the file and could not determine if it was damaged. Then
I noticed the 'DbReuseFind' reference (in the original MCH0203
message) which made me check the maximum file size, and number of
deleted records, as I know the file is configured to 'reuse deleted
records'. The file had some 2,500,000 records, and slightly more than
1,000,000 were deleted records, and the file has *NOMAX for Member
Size. So, I reorganized the file, just to see what would happen. It
reorganized just fine, eliminated the deleted records, and I
restarted the job.

The Reorganize Database File Member (RGZPFM) request will purge the deleted records, and thus purge the reuse-table; i.e. there will be no table in which deleted rows are tracked, thus the database insert\put will have no reason to look for the desire or ability to replace a deleted row with an active row.

Thus, the problem was "circumvented"; the origin of the problem is not [yet] investigated, and the details of the condition only exist in the logging, because the object [the dataspace] against which the failure transpired, no longer exhibits the problem.

Now the job runs fine.

*What do you think happened?*

The VLogs, the VL02001720 and possibly some nearby [esp. just before] in time, plus possibly some longer ago in the past for the object of the same address, may contain some information relevant to answering that question. What is clear however, is that the LIC DB was not happy about what transpired at that moment with that dataspace with respect to processing for the REUSEDLT(*YES), and the operation was failed\canceled with purpose.

BTW, another job, which only deletes records from this file, was
failing with the same Machine Check error, with only a slightly
different Error Code (hex 1716 instead of 1720, as above)

<<again; rewritten for compactness, symptoms, explanation>>

12/18/14 08:52:28.640853

MCH3203 #dbdelim 00343C QDBUDR 0C1D
12/18/14 08:54:09.273373
Message . . . . : Function error X'1716' in machine instruction.
Internal dump identifier (ID) .

msgMCH3203 F/#dbdelim x/00343C T/QDBUDR x/0C1D
rcx1716 rc1716 vl02001716

Not sure why the DMPID did not appear in the message data; that may itself [and seems to] be a defect.

This time the request was from the OS DB Delete [, Update, Release; i.e. the "UDR" of the program naming] processor, a request to the DB LIC method to Delete a Dataspace Entry [DB Delete Image is term for ¿DBdelim?]. Presumably this processing some ~100seconds later encountered what was effectively /the same/ issue with the ReUse feature, but from a different perspective; delete instead of insert. Quite possibly, like alluded for the other failure, the condition would have persisted.

Notice the only difference is the reference to #dbdelim on the
MCH3203 and the hex error code, now '1716' instead of '1720'.

I am now unsure if I recall the assert\assume minor code for the Function Check [major code 0200] Vlog; perhaps that is 1716 vs 1720, or perhaps both may be used in that manner. No matter, this is another failure for which the database I/O processing was terminated.

Both jobs are running fine now, after the reorg.

*Any thoughts?*

Again, the issue that had [presumably persisted at\from the time of the first logged incident] transpired was circumvented by that action. The origin for the issue and any underlying issues that might lead to the same failures may persist; e.g. code defects, for which the preventive [and\or corrective] PTFs await being installed, or for which the problem remains unreported for which such PTF(s) can be generated or the problem report [APAR] answered [perhaps as unreproducible or as ??].

I seem to recall some number of releases ago, when files could reach a maximum number of deleted rows. I do not recall if there was a corrective fix that did not require RGZPFM, CLRPFM, or re-create of the dataspace, but a design enhancement enabled multiple re-use spaces to allow for a huge number increase in the mapping\tracking of deleted records. Possibly the dataspace for the file [the data portion, associated with] the particular member against which the error transpired was created on an old release and some corrective action remained awaited until the member could support the larger number of deleted rows. That release and the release of the OS being utilized I do not think were mentioned in the OP. If the Analyze Problem (ANZPRB) as alluded in the messaging were followed and that led to an issue\PMR being opened with the Service Provider [and IBM Support looked at the collected Vlogs], they might be able to determine that the problem was due to the limit imposed on a down-level object, and that limit has since been increased, and that what was used as the circumvention might even have been corrective... or they might determine the issue is a defect [that has or awaits a PTF].