RE: 57B5 (5913) card/pair failure/recovery - difficult to montor and troubleshoot -- MIDRANGE-L

When you did the additional IPL's did you select the option on PWRDWNSYS to check hardware?
We only IPL our hosting lpar once a quarter. Part of the check list is to ensure that WRKDSKSTS does not show performance degraded.

Reminder: Frequent IPL's delete SQL performance data and can adversely affect performance.

You could use the DB2 service APIs to query QHST, QSYS2.HISTORY_LOG_INFO(), about the time of the first IPL and see if you see that message. Then try to figure out how to shoehorn that into your monitoring. Startup query perhaps?

The storage service APIs do not seem to cover the performance degraded status. I wonder if the System Health Services do?

-----Original Message-----
From: MIDRANGE-L <midrange-l-bounces@xxxxxxxxxxxxxxxxxx> On Behalf Of Steinmetz, Paul
Sent: Monday, January 21, 2019 3:42 PM
To: 'Midrange Systems Technical Discussion' <midrange-l@xxxxxxxxxxxxxxxxxx>
Subject: 57B5 (5913) card/pair failure/recovery - difficult to montor and troubleshoot

Early this morning, during an IPL, I had a 57B5 (5913) pair fail during an IPL.
Because our monitoring software is not running during the IPL, I initially did not see the alert.
Two additional IPLs - LPAR continued to run, but performance was EXTREMELY poor The initial "call home" PAL entry did not re-appear, only an increased count in the SAL entry, which I didn't see till later on.

SAL entry
Status Date Time SRC Resource Count PLID
NEW 01/21/19 01:10:38 57B59076 DC05 3

It was later discovered, with IBM hardware support, via SST - 6. Display disk hardware status That the pair was "Performance degraded", which implied the disk controllers were running with ZERO cache.

Performance degraded-
This state indicates the device is functional but performance may be impacted due to other hardware problems (such as a IOA cache problem).

We identified the "suspect" card, which was NOT operational.
Powered off the slot, powered slot back on.
LPAR disk performance problem re-solved.

I had similar failures on a different LPAR, different card pair over the years.
Those failures were not during an IPL, but while LPAR was running.
The difference in those two previous failures was the card/slot was automatically reset by the code.
Previously,

This error was an L2 cache error and the cards needed to do a reset for data integrity reasons.
The controllers went into a recovery, lasted 23 seconds, LPAR was "suspended" during this period.
During the recovery, several applications failed, which then need a manual reset/recycle.

1) How does one better monitor for these types of card/pair failures?
2) Why did 2nd and 3rd IPL not reset the card?
3) Why did 2nd and 3rd IPL not "call home" and create a new PAL entry.
4) Anyone else from the group experience similar card/pair failures?

Thank You
_____
Paul Steinmetz
IBM i Systems Administrator

Pencor Services, Inc.
462 Delaware Ave
Palmerton Pa 18071

610-826-9117 work
610-826-9188 fax
610-349-0913 cell
610-377-6012 home

psteinmetz@xxxxxxxxxx
http://www.pencor.com/