Rob,
Yes, I thought of possibly using a WRKWCH for PAL and/or LIC entries.
However, WRKWCH would not be running during an IPL, probably would be missed.
The PAL are stored in QUSRSYS/QASXPROB.
I thought of maybe checking this PF with "something"
Paul
-----Original Message-----
From: MIDRANGE-L [mailto:midrange-l-bounces@xxxxxxxxxxxxxxxxxx] On Behalf Of Rob Berendt
Sent: Tuesday, January 22, 2019 8:29 AM
To: Midrange Systems Technical Discussion
Subject: RE: 57B5 (5913) card/pair failure/recovery - difficult to montor and troubleshoot
IDK of any current service which will query PAL or SAL. But you could submit an RFE.
https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_73/apis/qscswch.htm
The Start Watch (QSCSWCH) API starts the watch for event function, which notifies the user by calling a user specified program when the specified event (a message, a LIC log or a PAL) occurs. PAL stands for Product Activity Log which shows errors that have occurred (such as in disk and tape units, communications, and work stations).
Occasionally IBM allows some of the service stuff to be accessed outside of SST also. Starting, stopping and managing the data from comm traces comes to mind.
-----Original Message-----
From: MIDRANGE-L <midrange-l-bounces@xxxxxxxxxxxxxxxxxx> On Behalf Of Steinmetz, Paul
Sent: Tuesday, January 22, 2019 8:04 AM
To: 'Midrange Systems Technical Discussion' <midrange-l@xxxxxxxxxxxxxxxxxx>
Subject: RE: 57B5 (5913) card/pair failure/recovery - difficult to montor and troubleshoot
Nothing in QST.
One initial PAL entry.
One SAL entry with count of 3.
Can either the PAL or SAL be monitored?
Paul
-----Original Message-----
From: MIDRANGE-L [mailto:midrange-l-bounces@xxxxxxxxxxxxxxxxxx] On Behalf Of Rob Berendt
Sent: Tuesday, January 22, 2019 7:54 AM
To: Midrange Systems Technical Discussion
Subject: RE: 57B5 (5913) card/pair failure/recovery - difficult to montor and troubleshoot
Was there any matching records in QHST during that time? If so, this is quite easy to monitor, even if you have to periodically query QHST (using the appropriate service API) and feed that into your monitoring tool.
Have you asked your tool vendor if they "catch up" on QHST after an IPL? It's worth the time to question them.
-----Original Message-----
From: MIDRANGE-L <midrange-l-bounces@xxxxxxxxxxxxxxxxxx> On Behalf Of Steinmetz, Paul
Sent: Monday, January 21, 2019 4:27 PM
To: 'Midrange Systems Technical Discussion' <midrange-l@xxxxxxxxxxxxxxxxxx>
Subject: RE: 57B5 (5913) card/pair failure/recovery - difficult to montor and troubleshoot
SST Log analysis shows the below entry during each IPL.
Most monitoring tools can't/don't access PAL, SAL, LIC Log etc.
Even if they could, they would be missed because during an IPL (would not be active).
Currently, I can't monitor during an IPL and I can't monitor for any of these.
System Resource Resource
Ref Code Date Time Class Name Type
57B59076 01/21/19 01:10:38 Perm DC05 57B5
B6005090 01/21/19 01:11:39 Qual DMP048 19B3
57B59076 01/21/19 03:23:58 Perm DC05 57B5
B6005090 01/21/19 03:24:59 Qual DMP048 19B3
57B59076 01/21/19 04:05:47 Perm DC05 57B5
B6005090 01/21/19 04:06:52 Qual DMP048 19B3
Paul
-----Original Message-----
From: MIDRANGE-L [mailto:midrange-l-bounces@xxxxxxxxxxxxxxxxxx] On Behalf Of Rob Berendt
Sent: Monday, January 21, 2019 4:17 PM
To: Midrange Systems Technical Discussion
Subject: RE: 57B5 (5913) card/pair failure/recovery - difficult to montor and troubleshoot
When you did the additional IPL's did you select the option on PWRDWNSYS to check hardware?
We only IPL our hosting lpar once a quarter. Part of the check list is to ensure that WRKDSKSTS does not show performance degraded.
Reminder: Frequent IPL's delete SQL performance data and can adversely affect performance.
You could use the DB2 service APIs to query QHST, QSYS2.HISTORY_LOG_INFO(), about the time of the first IPL and see if you see that message. Then try to figure out how to shoehorn that into your monitoring. Startup query perhaps?
The storage service APIs do not seem to cover the performance degraded status. I wonder if the System Health Services do?
-----Original Message-----
From: MIDRANGE-L <midrange-l-bounces@xxxxxxxxxxxxxxxxxx> On Behalf Of Steinmetz, Paul
Sent: Monday, January 21, 2019 3:42 PM
To: 'Midrange Systems Technical Discussion' <midrange-l@xxxxxxxxxxxxxxxxxx>
Subject: 57B5 (5913) card/pair failure/recovery - difficult to montor and troubleshoot
Early this morning, during an IPL, I had a 57B5 (5913) pair fail during an IPL.
Because our monitoring software is not running during the IPL, I initially did not see the alert.
Two additional IPLs - LPAR continued to run, but performance was EXTREMELY poor The initial "call home" PAL entry did not re-appear, only an increased count in the SAL entry, which I didn't see till later on.
SAL entry
Status Date Time SRC Resource Count PLID
NEW 01/21/19 01:10:38 57B59076 DC05 3
It was later discovered, with IBM hardware support, via SST - 6. Display disk hardware status That the pair was "Performance degraded", which implied the disk controllers were running with ZERO cache.
Performance degraded-
This state indicates the device is functional but performance may be impacted due to other hardware problems (such as a IOA cache problem).
We identified the "suspect" card, which was NOT operational.
Powered off the slot, powered slot back on.
LPAR disk performance problem re-solved.
I had similar failures on a different LPAR, different card pair over the years.
Those failures were not during an IPL, but while LPAR was running.
The difference in those two previous failures was the card/slot was automatically reset by the code.
Previously,
This error was an L2 cache error and the cards needed to do a reset for data integrity reasons.
The controllers went into a recovery, lasted 23 seconds, LPAR was "suspended" during this period.
During the recovery, several applications failed, which then need a manual reset/recycle.
1) How does one better monitor for these types of card/pair failures?
2) Why did 2nd and 3rd IPL not reset the card?
3) Why did 2nd and 3rd IPL not "call home" and create a new PAL entry.
4) Anyone else from the group experience similar card/pair failures?
Thank You
_____
Paul Steinmetz
IBM i Systems Administrator
Pencor Services, Inc.
462 Delaware Ave
Palmerton Pa 18071
610-826-9117 work
610-826-9188 fax
610-349-0913 cell
610-377-6012 home
psteinmetz@xxxxxxxxxx
http://www.pencor.com/
--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list To post a message email: MIDRANGE-L@xxxxxxxxxxxxxxxxxx To subscribe, unsubscribe, or change list options,
visit:
https://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives at
https://archive.midrange.com/midrange-l.
Please contact support@xxxxxxxxxxxx for any subscription related questions.
Help support midrange.com by shopping at amazon.com with our affiliate link:
https://amazon.midrange.com
--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list To post a message email: MIDRANGE-L@xxxxxxxxxxxxxxxxxx To subscribe, unsubscribe, or change list options,
visit:
https://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives at
https://archive.midrange.com/midrange-l.
Please contact support@xxxxxxxxxxxx for any subscription related questions.
Help support midrange.com by shopping at amazon.com with our affiliate link:
https://amazon.midrange.com
--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list To post a message email: MIDRANGE-L@xxxxxxxxxxxxxxxxxx To subscribe, unsubscribe, or change list options,
visit:
https://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives at
https://archive.midrange.com/midrange-l.
Please contact support@xxxxxxxxxxxx for any subscription related questions.
Help support midrange.com by shopping at amazon.com with our affiliate link:
https://amazon.midrange.com
--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list To post a message email: MIDRANGE-L@xxxxxxxxxxxxxxxxxx To subscribe, unsubscribe, or change list options,
visit:
https://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives at
https://archive.midrange.com/midrange-l.
Please contact support@xxxxxxxxxxxx for any subscription related questions.
Help support midrange.com by shopping at amazon.com with our affiliate link:
https://amazon.midrange.com
--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list To post a message email: MIDRANGE-L@xxxxxxxxxxxxxxxxxx To subscribe, unsubscribe, or change list options,
visit:
https://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxxxxxxxx
Before posting, please take a moment to review the archives at
https://archive.midrange.com/midrange-l.
Please contact support@xxxxxxxxxxxx for any subscription related questions.
Help support midrange.com by shopping at amazon.com with our affiliate link:
https://amazon.midrange.com
As an Amazon Associate we earn from qualifying purchases.