RE: System Crash -- MIDRANGE-L

Could it be firmware related.

We had two issues, one possibly due to back level firmware.

1) Back in June 2012, our P7 740 E6C threw a SRC A7004742
We replaced the anchor card, VPD.
Also updated
AL740_088 firmware installed.
I found this note in the firmware, might be related.

A problem was fixed that caused informational SRC A70047FF, which may indicate that the Anchor (VPD) card should be replaced, to be erroneously logged again after the Anchor card was replaced

Also, working with Small Products, we think we found why A7004742 did not generate a hardware call. 4742 was not present in the threshold table. If a match is not found, hardware call not generated even though the LPAR called home.
Go service, option 16. Work with threshold table
We added 4742 manually.

2) Back in March 2013, had a B7006A21
The backplane (mother board) had to be replaced.

Explanation
PCI-E Switch had a permanent unrecoverable (internal) chip failure. All downstream I/O is failed.
Response
Collect error logs and info logs. Replace the PCI-E Switch FRU, and send the failed FRU with the error log data to IBM for fault analysis.
Failing Item

PCIE_SW

Paul

-----Original Message-----
From: MIDRANGE-L [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Roberto José Etcheverry Romero
Sent: Friday, February 13, 2015 7:58 PM
To: Midrange Systems Technical Discussion
Subject: Re: System Crash

You say SCSI bus and mention MSIOP but, aren't SCSI and IOP's deprecated on p7?
It should have a Dual IOP-less SAS controller. If we're talking about the CEC of course.
I hope you get it back up quickly. I'm waiting on a client that left the "apply PTF" stage just for when i had to add some cards to the enclosure (thereby giving me enough spare time to post this reply...). And i found out that an IBM CE had replaced batteries just a few days ago. Weird thing?
C19 was reporting as unknown and a PCI bus failure was being logged.
Re-seated said card and everything went back to normal. Now, C19 is one of the 2 controllers that manage the internal disks on the 720 so the system had been working with one controller for the last 38 days. So an IBM CE left a card unseated AND didn't check VIOS' diag tool to check everything was OK after replacing the batteries.
Maybe it's something like that? you were working on one controller and it threw a fit?

Best of luck and hope it's quickly fixed.

Roberto

On Fri, Feb 13, 2015 at 8:38 PM, Graap, Kenneth <Kenneth.Graap@xxxxxxxxxxxxx

wrote:

Our Power7 server CRASHED again. The second time in the last 3 months!

The first time was LIC related, this time it was a hardware failure.

When one of the mirrored load source drives failed it threw some "noise"
out on the SCSI BUS which caused the MSIOP to fail and the system crashed.

Do you think this has anything to do with it being Friday the 13th?!?

Has anyone ever experienced anything like this? You'd think that
mirrored disk protection for the load source disk would be sufficient
to protect against a system crash. Apparently not!

Kenneth
Kenneth E. Graap
NW Natural
System Administrator for IBM Power Systems
503.226.4211 x5537
http://www.linkedin.com/in/kennethgraap

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing
list To post a message email: MIDRANGE-L@xxxxxxxxxxxx To subscribe,
unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx Before posting, please take
a moment to review the archives at
http://archive.midrange.com/midrange-l.

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list To post a message email: MIDRANGE-L@xxxxxxxxxxxx To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx Before posting, please take a moment to review the archives at http://archive.midrange.com/midrange-l.