× The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.



Larry,

In all 3 cases, the card(s) continued to function without issue after a reset, cards were never replaced.
So, is the card bad?
Or, does the code need to be improved?

Paul

-----Original Message-----
From: MIDRANGE-L [mailto:midrange-l-bounces@xxxxxxxxxxxxxxxxxx] On Behalf Of DrFranken
Sent: Monday, January 21, 2019 4:05 PM
To: Midrange Systems Technical Discussion
Subject: Re: 57B5 (5913) card/pair failure/recovery - difficult to montor and troubleshoot

Not necessarily all answers in order. :-)

These cards are getting quite old and we are seeing failures. If you
recall these were the very FIRST cards with no batteries.

"Degraded" used to indicate battery failure primarily because that
'turned off' the cache due to the fact that the cache is no longer
backed up. However the loss of one card in any RAID card pair also turns
off the cache and for the same reason. So "Degraded" today mostly does
indicate either a card failure or a cable is loose.

The additional IPLs might have not reset the card if it was a hard fail
that stuck the card in a loop for example. I would almost expect a
PWRDWNSYS RESTART(*YES) IPL to NOT fix this. A full power down and
restart from the HMC or certainly pushing of the white button on a
single partition server would be more likely to break the loop and wake
up the card. Potentially even powering down the slot in service tools
and back up would fix this. I wouldn't be inclined to do this while the
system was in production though.

Creating a new entry wouldn't be needed since the problem still exists
in SAL and is not acknowledged, closed or cleared. It IS the same
problem so just count the occurrences.

I would think the best defense is to check WRKPRB as there should be an
open problem there.

- Larry "DrFranken" Bolhuis

www.Frankeni.com
www.iDevCloud.com - Personal Development IBM i timeshare service.
www.iInTheCloud.com - Commercial IBM i Cloud Hosting.

On 1/21/2019 3:42 PM, Steinmetz, Paul wrote:
Early this morning, during an IPL, I had a 57B5 (5913) pair fail during an IPL.
Because our monitoring software is not running during the IPL, I initially did not see the alert.
Two additional IPLs - LPAR continued to run, but performance was EXTREMELY poor
The initial "call home" PAL entry did not re-appear, only an increased count in the SAL entry, which I didn't see till later on.

SAL entry
Status Date Time SRC Resource Count PLID
NEW 01/21/19 01:10:38 57B59076 DC05 3

It was later discovered, with IBM hardware support, via SST - 6. Display disk hardware status
That the pair was "Performance degraded", which implied the disk controllers were running with ZERO cache.

Performance degraded-
This state indicates the device is functional but performance may be impacted due to other hardware problems (such as a IOA cache problem).

We identified the "suspect" card, which was NOT operational.
Powered off the slot, powered slot back on.
LPAR disk performance problem re-solved.

I had similar failures on a different LPAR, different card pair over the years.
Those failures were not during an IPL, but while LPAR was running.
The difference in those two previous failures was the card/slot was automatically reset by the code.
Previously,

This error was an L2 cache error and the cards needed to do a reset for data integrity reasons.
The controllers went into a recovery, lasted 23 seconds, LPAR was "suspended" during this period.
During the recovery, several applications failed, which then need a manual reset/recycle.

1) How does one better monitor for these types of card/pair failures?
2) Why did 2nd and 3rd IPL not reset the card?
3) Why did 2nd and 3rd IPL not "call home" and create a new PAL entry.
4) Anyone else from the group experience similar card/pair failures?

Thank You
_____
Paul Steinmetz
IBM i Systems Administrator

Pencor Services, Inc.
462 Delaware Ave
Palmerton Pa 18071

610-826-9117 work
610-826-9188 fax
610-349-0913 cell
610-377-6012 home

psteinmetz@xxxxxxxxxx
http://www.pencor.com/










As an Amazon Associate we earn from qualifying purchases.

This thread ...

Follow-Ups:
Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.