On 18-Dec-2014 16:06 -0600, Thomas Garvey wrote:
We have a program that writes records into an externally defined
database file that has received the following error...
<<SNIPped joblog: rewritten for compactness, symptoms,
and adding some explanation>>
MCH3203 DbReuseFind 000FB4 QDBPUT 06B1
Message . . . . : Function error X'1720' in machine instruction.
Internal dump identifier (ID) 0300167E.
msgMCH3203 f/DbReuseFind x/000FB4 T/QDBPUT x06B1
rcx1720 rc1720 vl02001720
The LIC DB code that attempted to locate a location for a [set of]
deleted record(s) failed; a VLIC Log (VLog) was generated [major code
0200 for "function check" and minor code 1720 [for an effective "failed
assumption] that is logged to record the failure. The failed attempt
was a "put", aka a write or insert; intuitively, the dataspace appeared
to support Reuse Deleted Records (REUSEDLT), and thus the LIC DB was
looking for a place to insert the record(s) into a slot held by a
since-deleted row. Something about this search went awry, the error
which likely is an assertion that something should have been, but was
found not to have been [i.e. failed assumption], so rather than
progressing in spite of the bad information [which could have led to
GIGO], the operation was canceled to protect the integrity of the data.
CPF9999 QMHUNMSG *N QDBPUT 06B1
Message . . . . : Function check. MCH3203 unmonitored by QDBPUT
at statement *N, instruction X'06B1'.
msgCPF9999 F/QMHUNMSG t/QDBPUT x/06B1
The LIC DB module that was invoked to implement the /insert-row/
method is, at that fix-level of the OS DB program QDBPUT, occurs at the
failing instruction x/06B1 [for whatever specific type of write
activity, according to the code path].
The generic "function check" condition, aka the *FC, indicates that
the OS Database "Put" program did not monitor for the failure... which
is correct, there is really nothing the DB2 Insert feature can do when
the underlying database support fails, so the message be manifest as an
"unmonitored failure" which also enables some internal diagnostic
messaging support to better record the failure.
CPF3698 QSCPUTR 005E QMHAPD 0500
Message . . . . : Dump output directed to spooled file 1,
job 875639/JOBNAME/E206151053 created on system ITEAM on...
msgCPF3698 F/QSCPUTR x/005E T/QMHAPD x/0500
As part of Software Logging [SC: ¿Service Component?] for software
errors, the prior /unmonitored/ condition is further logged as a
"software problem". Really not germane, just a record that the OS was
told to log, so that was done, and the effect was logged in the joblog.
CPF5257 QDBSIGEX 0539 E20615105R WORKLIB STMT/2123
Message . . . . : Failure for device or member E206151053 file
E206151053 in library WORKLIB.
Cause . . : An error occurred during a read or write operation.
msgCPF5257 F/QDBSIGEX x/0539 t/usrpgm
The user program issuing the failing statement, a WRITE, INSERT,
whatever, is: E20615105R in WORKLIB STMT/2123
The failed I/O must be informed of, to the requesting program. The
I/O error is signaled by the generic "Signal Exception" routine of the
OS Database component, using the Common Data Management [DM] message
range of messages, because the I/O was via the Open Data Path (ODP)
created for the Database Open request; the member, file, and library,
are offered up as part of the information, and that the operation was a
"write" vs a read can be inferred from the OS DB program name.
I examined the file and could not determine if it was damaged. Then
I noticed the 'DbReuseFind' reference (in the original MCH0203
message) which made me check the maximum file size, and number of
deleted records, as I know the file is configured to 'reuse deleted
records'. The file had some 2,500,000 records, and slightly more than
1,000,000 were deleted records, and the file has *NOMAX for Member
Size. So, I reorganized the file, just to see what would happen. It
reorganized just fine, eliminated the deleted records, and I
restarted the job.
The Reorganize Database File Member (RGZPFM) request will purge the
deleted records, and thus purge the reuse-table; i.e. there will be no
table in which deleted rows are tracked, thus the database insert\put
will have no reason to look for the desire or ability to replace a
deleted row with an active row.
Thus, the problem was "circumvented"; the origin of the problem is
not [yet] investigated, and the details of the condition only exist in
the logging, because the object [the dataspace] against which the
failure transpired, no longer exhibits the problem.
Now the job runs fine.
*What do you think happened?*
The VLogs, the VL02001720 and possibly some nearby [esp. just before]
in time, plus possibly some longer ago in the past for the object of the
same address, may contain some information relevant to answering that
question. What is clear however, is that the LIC DB was not happy about
what transpired at that moment with that dataspace with respect to
processing for the REUSEDLT(*YES), and the operation was failed\canceled
BTW, another job, which only deletes records from this file, was
failing with the same Machine Check error, with only a slightly
different Error Code (hex 1716 instead of 1720, as above)
<<again; rewritten for compactness, symptoms, explanation>>
MCH3203 #dbdelim 00343C QDBUDR 0C1D
Message . . . . : Function error X'1716' in machine instruction.
Internal dump identifier (ID) .
msgMCH3203 F/#dbdelim x/00343C T/QDBUDR x/0C1D
rcx1716 rc1716 vl02001716
Not sure why the DMPID did not appear in the message data; that may
itself [and seems to] be a defect.
This time the request was from the OS DB Delete [, Update, Release;
i.e. the "UDR" of the program naming] processor, a request to the DB LIC
method to Delete a Dataspace Entry [DB Delete Image is term for
¿DBdelim?]. Presumably this processing some ~100seconds later
encountered what was effectively /the same/ issue with the ReUse
feature, but from a different perspective; delete instead of insert.
Quite possibly, like alluded for the other failure, the condition would
Notice the only difference is the reference to #dbdelim on the
MCH3203 and the hex error code, now '1716' instead of '1720'.
I am now unsure if I recall the assert\assume minor code for the
Function Check [major code 0200] Vlog; perhaps that is 1716 vs 1720, or
perhaps both may be used in that manner. No matter, this is another
failure for which the database I/O processing was terminated.
Both jobs are running fine now, after the reorg.
Again, the issue that had [presumably persisted at\from the time of
the first logged incident] transpired was circumvented by that action.
The origin for the issue and any underlying issues that might lead to
the same failures may persist; e.g. code defects, for which the
preventive [and\or corrective] PTFs await being installed, or for which
the problem remains unreported for which such PTF(s) can be generated or
the problem report [APAR] answered [perhaps as unreproducible or as ??].
I seem to recall some number of releases ago, when files could reach
a maximum number of deleted rows. I do not recall if there was a
corrective fix that did not require RGZPFM, CLRPFM, or re-create of the
dataspace, but a design enhancement enabled multiple re-use spaces to
allow for a huge number increase in the mapping\tracking of deleted
records. Possibly the dataspace for the file [the data portion,
associated with] the particular member against which the error
transpired was created on an old release and some corrective action
remained awaited until the member could support the larger number of
deleted rows. That release and the release of the OS being utilized I
do not think were mentioned in the OP. If the Analyze Problem (ANZPRB)
as alluded in the messaging were followed and that led to an issue\PMR
being opened with the Service Provider [and IBM Support looked at the
collected Vlogs], they might be able to determine that the problem was
due to the limit imposed on a down-level object, and that limit has
since been increased, and that what was used as the circumvention might
even have been corrective... or they might determine the issue is a
defect [that has or awaits a PTF].