Below is an update from IBM.
For the archives.
I've staggered my DUPMEDBRM start times by 5 minutes, should avoid the issue, which is probably not fixable.
- When an operation needing a tape resource runs and needs a tape drive the tape code will attempt to reserve a tape drive. If it cannot get a
reservation on it's first choice it will try again to reserve another
drive. It will do this repeatedly until it gets a reservation or until
all drive resources have been tried. Assuming a drive is found that is
available a 'reservation' is put on the tape drive resource. This
reservation exists on the tape drive itself and is assigned to a
specific adapter port by World Wide Port Name (WWPN). This reservation will be persistent until the host releases the reservation. If anything
stops the host from releasing the reservation the drive will remain
reserved and can only be used by that specific port adapter.
- Once a drive is reserved and a tape needs to be mounted, tape code
will check to see if the tape is already mounted in another drive. If
the tape is mounted in another drive the system running the operation
needing a tape resource will attempt to reserve the tape drive that the
tape is mounted in. If it gets the reservation for a short period of
time the job will have caused two drives to be reserved. The first drive will be released shortly, however if multiple jobs are starting at the
same time that need drives it is possible that each job could reserve
multiple drives for short periods of time and some job may not be able to get a drive. Operations requiring more than one tape resource such as
DUPMEDBRM or DUPTAP can magnify this issue. Slightly staggering job
start times can reduce this possibility.
- Different types of errors (user, job, device) can cause a reservation to be left hanging on a drive. Once this happens, only the system using the adapter port with the proper WWPN will be able to use the tape drive
or release the reservation.
- Although it may not be common; changes to the fabric can make it
impossible for a system to release a reservation. For example:
Job Runs and TAP01 is reserved
Job ends abnormally and does not release reservation
Fibre cable is moved to a different adapter port
At this point there is no way any of the host systems can release this
reservation...
=============== KMM Information ================
Date: 15/04/16
Time: 1302
Description: CPS Discussion Item
Action: Used
Contributed: Yes
Content Source: CPS
Content: 9UNPUH
================================================
Paul
-----Original Message-----
From: MIDRANGE-L [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Steinmetz, Paul
Sent: Thursday, April 02, 2015 2:54 PM
To: 'Midrange Systems Technical Discussion'
Subject: RE: CPP6316 - Hardware configuration change detected, followed by CPF414C - Command not allowed
IBM support and development both agree that this is a timing issue.
It's difficult to obtain the required Tape Flight Recorders, which development needs to research and possibly resolve the issue.
It was suggested to change or lengthen the amount of time BRMS is waiting for TAPMLB01.
By default, TAPMLB01 device description wait times are set to *job.
Initial mount wait time . . . . . : *JOB
End of volume mount wait time . . : *JOB
I don't want to change TAPMLB01 device description, this setting could get lost on recreates.
The BRMS DUPMEDBRM job wait time is derived from the class for the subsystem in which the job is running, which is currently 30 seconds.
Default wait time in seconds . . . . . . . . . . : 30
Increasing the wait time for this class could impact other processes.
How else could I increase the BRMS DUPMEDBRM default wait time of 30 seconds, to say 60 or 120?
Checking all my classes, most set to 30.
I did find class QBATCH, set to 120, but I don't think this is being used.
Decades ago, we created our own versions of QINTER and QBATCH.
I'm not sure why, but all our batch subsystems, are running at 30, not 120.
A routing entry for the subsystem determines which Class is used.
I could create a new Class for BRMS only, add a new routing entry to the subsystem, thus only BRMS jobs would use the Class with longer default wait time.
What are others class default wait times set to?
Your thoughts?
WRKCLS CLS(*ALL/*ALL)
Paul
-----Original Message-----
From: MIDRANGE-L [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of rob@xxxxxxxxx
Sent: Wednesday, April 01, 2015 4:01 PM
To: Midrange Systems Technical Discussion
Subject: RE: CPP6316 - Hardware configuration change detected, followed by CPF414C - Command not allowed
In the tape library there is an operation (via the web interface) to download some tape logs. Normally this is something that I collect and ship off to IBM instead of look at myself. IDK if you can look through these and see something like: too many concurrent tape drive operations.
But, seriously, auto tape cleaning was blowing up our backups. It would actually stop during the middle of the backup, eject the backup tape, and do a cleaning. It would be nice if it could do this and then continue on with the backup but that was NOT our experience. There's a difference between "hey a cleaning would be nice around now" and "cough, gag, I'm aborting the backup". Sort of like the difference between a warning and a hard halt.
I know you said that the previous time cleaning was not an issue, but perhaps there was a resource conflict due to too many simultaneous backups.
Rob Berendt
--
IBM Certified System Administrator - IBM i 6.1 Group Dekko Dept 1600 Mail to: 2505 Dekko Drive
Garrett, IN 46738
Ship to: Dock 108
6928N 400E
Kendallville, IN 46755
http://www.dekko.com
From: "Steinmetz, Paul" <PSteinmetz@xxxxxxxxxx>
To: "'Midrange Systems Technical Discussion'"
<midrange-l@xxxxxxxxxxxx>
Date: 04/01/2015 03:40 PM
Subject: RE: CPP6316 - Hardware configuration change detected,
followed by CPF414C - Command not allowed
Sent by: "MIDRANGE-L" <midrange-l-bounces@xxxxxxxxxxxx>
Rob,
Auto clean works like a champ, once it was properly configured, proper slot, correct bar code label, et.
The Jan failure did not include an auto clean, but still same failure.
Paul
-----Original Message-----
From: MIDRANGE-L [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of rob@xxxxxxxxx
Sent: Wednesday, April 01, 2015 3:22 PM
To: Midrange Systems Technical Discussion
Subject: RE: CPP6316 - Hardware configuration change detected, followed by CPF414C - Command not allowed
CPP6316 suggests doing an IOP reset because a configuration change may have occurred without doing an IOP reset. Or it just may be a hardware issue.
We do not autoclean. And I think that was because of issues like this. We just do a manual clean every so many weeks.
You could add more tape drives, we have 10 in one library and four in another.
Rob Berendt
--
IBM Certified System Administrator - IBM i 6.1 Group Dekko Dept 1600 Mail
to: 2505 Dekko Drive
Garrett, IN 46738
Ship to: Dock 108
6928N 400E
Kendallville, IN 46755
http://www.dekko.com
From: "Steinmetz, Paul" <PSteinmetz@xxxxxxxxxx>
To: "'Midrange Systems Technical Discussion'"
<midrange-l@xxxxxxxxxxxx>
Date: 04/01/2015 03:07 PM
Subject: RE: CPP6316 - Hardware configuration change detected,
followed by CPF414C - Command not allowed
Sent by: "MIDRANGE-L" <midrange-l-bounces@xxxxxxxxxxxx>
Rob,
I had a similar issue once when a drive was replaced and then the new
drive didn't report in properly.
This is not the case.
I think this is a tape library timing issue, where, if the library does
not complete its request in the allotted time, a failure will occur.
In this case, I think a request was made to mount a volume, the library
was busy with other processing, so the mount did not occur in the allotted
time, thus a failure occurred.
Your thoughts, how and who should this be reported to?
Paul
-----Original Message-----
From: MIDRANGE-L [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of
rob@xxxxxxxxx
Sent: Wednesday, April 01, 2015 2:13 PM
To: Midrange Systems Technical Discussion
Subject: RE: CPP6316 - Hardware configuration change detected, followed by
CPF414C - Command not allowed
I just replaced one of the LTO4HH drives in one of our libraries. Would
often drop connection well into a save.
When they replaced the drive I had some 'fun'. The serial number of the
drive (not the library) changed.
Drive is controlled by VIOS. That VIOS lpar does the fiber tape drive.
WRKMLBSTS and vary off for all lpars of IBM i.
WRKCFGSTS *DEV TAP* and vary off for all lpars of IBM i.
replace drive
Bounce VIOS lpar. (probably could have done 'iop reset' if I new vios
better.)
WRKHDWRSC *STG
http://mytaplib, record serial number of drives.
STRSST delete old resource for replaced drive. Rename resource of new
drive to that of replaced drive. I'm OCD on this as I like the resource
names to match the drive names. And I like them to be consistent across
all lpars.
WRKMLBSTS ensure that it is still varied off WRKCFGSTS *DEV TAP* on just
one lpar. Vary on replaced tape drive (not library). Use 'itdt' to
stress test drive. Passed. Vary off TAP* devices.
WRKMLBSTS and vary it on.
make sure that only the replaced drive is 'allocated unprotected'. Do
long extensive BRMS save. When complete, expire the tape.
WRKMLBSTS for all lpars and make sure library is ready and drives are all
allocated unprotected.
Rob Berendt
--
IBM Certified System Administrator - IBM i 6.1 Group Dekko Dept 1600 Mail
to: 2505 Dekko Drive
Garrett, IN 46738
Ship to: Dock 108
6928N 400E
Kendallville, IN 46755
http://www.dekko.com
From: "Steinmetz, Paul" <PSteinmetz@xxxxxxxxxx>
To: "'Midrange Systems Technical Discussion'"
<midrange-l@xxxxxxxxxxxx>
Date: 04/01/2015 01:54 PM
Subject: RE: CPP6316 - Hardware configuration change detected,
followed by CPF414C - Command not allowed
Sent by: "MIDRANGE-L" <midrange-l-bounces@xxxxxxxxxxxx>
I had a repeat of this issue/failure
We had the exact same series of messages, (from our 2/1 failure) along
with a DUPMEDBRM failure, different LPAR.
There was also a drive auto clean request issued 2:31 am, may or may not
be related.
So the library was busy moving volumes for 4 drives, 2 lpars, between 2:30
and 2:32 am.
CPP6316 - Hardware configuration change detected.
BRM4138 - Media duplication completed with errors.
CPF414C - Command not allowed
I think the issue is related to month end tape processing.
Both LPARS were in the process of starting a DUPMEDBRM with in the same
minute, 2:30 am.
Unfortunately no drive dumps obtained.
Serial Resource
Name Type Model Number Name
TAPMLB01 3573 040 04-7808387 TAPMLB01
Log ID . . . . . . . . . : 800C83F7 Sequence . . . . . . . : 1641240
Date . . . . . . . . . . : 04/01/15 Time . . . . . . . . . : 02:32:03
Reference code . . . . . : 9220 Secondary code . . . . : 00000000
Table ID . . . . . . . . : 94290310 IPL source/state . . . : B/3
System Ref Code . . . . . : 94299220
Server of origin . . . . : 8205-E6C 10-5815R
Class . . . . . . . . . . : Permanent
Hardware configuration change detected.
Below is IBM's PMR explanation from the 2/1 failure.
"The issue is that the library reported it was offline in the middle of a
function. The question is why. There is nothing from the IBM i that would
account for this.
The User command was running, we were processing a mount function, the
library was taking commands and sending data, we were able to identified
the location of the desired tape, we had the tape device, but when we
issued the move command the library returned offline. The library for some
reason decided to respond offline. Which then leads to the PAL and the
CPF414C.
We know the PAL is not the correct one. The SK 2/0412 error was mapped to
a hardware configure error and it should be mapped to a library state
change. But in either case the end result would be the same... CPF414C
library not in library mode. We have this PAL change documented.
So if you or your staff can not account for a reason why the library would
have been offline, then we would need the library and drive logs
collected at the time of the failure to provide why the library responded
the way it did.
If this issue can be reproduced we will need the following collected at
the time of failure....
Call QTADMPDV TAPMLBxx
Library and drive logs captured from the Library GUI Service functions
(Tape Support would assist with the operation if needed)"
Anyone experience something similar?
Any thoughts on which support to contact?
Paul
-----Original Message-----
From: Steinmetz, Paul
Sent: Monday, February 02, 2015 1:26 PM
To: 'Midrange Systems Technical Discussion'
Subject: CPP6316 - Hardware configuration change detected, followed by
CPF414C - Command not allowed
I had this error occur over the weekend, during a DUPMEDBRM.
The Dup failed, and the recovery.
TAPMLB01 then recovered with no operator intervention.
Following DUPMEDBRM, and saves were successful.
Has anyone ever experienced this issue?
Message ID . . . . . . : CPP6316
Date sent . . . . . . : 02/01/15 Time sent . . . . . . : 06:10:20
Message . . . . : Hardware configuration change detected.
Cause . . . . . : Device *N has reported an error that indicates that
the
configuration may have been changed without the IOP being reset. There
may
also be a hardware problem with the device.
Recovery . . . :
Reset the IOP. If the problem continues press F14 to work with the
problem.
Technical description . . . . . . . . :
IOP resource . . . . . . . . . . . . : CMB03
IOA resource . . . . . . . . . . . . : DC02
Device type . . . . . . . . . . . . . : 3573
Reference code . . . . . . . . . . . : X'9220'
Error log ID . . . . . . . . . . . . : X'8021F782'
Problem log ID . . . . . . . . . . . : 1503221191
Message ID . . . . . . : CPF414C
Date sent . . . . . . : 02/01/15 Time sent . . . . . . : 06:10:21
Message . . . . : Command not allowed
Cause . . . . . : Library device TAPMLB01 is not in library mode.
Recovery . . . : Switch library device TAPMLB01 to library mode and
retry
the operation.
Thank You
_____
Paul Steinmetz
IBM i Systems Administrator
Pencor Services, Inc.
462 Delaware Ave
Palmerton Pa 18071
610-826-9117 work
610-826-9188 fax
610-349-0913 cell
610-377-6012 home
psteinmetz@xxxxxxxxxx
http://www.pencor.com/
As an Amazon Associate we earn from qualifying purchases.