MIDRANGE dot COM Mailing List Archive



Home » MIDRANGE-L » November 2006

Well... it finally happened.



fixed


I've been working with the AS/400/iSeries/i5 for around 12 years now and
administering (at least) one of them at various jobs for about 8 years.
I've seen only one true crash up till now where the machine went down hard
with no advance warning.  Until now...

Monday night while Domino was going down for an automated backup, we lost a
90mb RAID controller card that controls our 2nd set of 4 RAIDed drives.  We
switched our Domino users over to the clustered server and our WAS apps
over to the other box as well (15-20min task).  Found the box down the next
morning and spent the morning diagnosing the actual failure with the DASD
group in Rochester.  The afternoon was spent trying to reseat the card and
remove/reinstall it after sitting outside of the machine for a while, but
this particular card never would come up as visible to the server again.
The CE showed up with the couriered replacement at around 6pm and we spent
the next 5 hours on the phone with Rochester trying several other tricks to
try and get the lost cache back - no dice (48800 'blocks' was the statistic
of what was lost in the cache).  We got the word at around 10:30pm that a
reload/restore would be necessary.

Yesterday was spent doing the restore and testing out the basics of WAS and
Domino.  I plugged the ethernet cable back in to the network at around
6:45pm last night and Domino was replicated back to normal (from last
Friday's backup through Wednesday evening's current activity) by 7:30pm.
WAS and Domino both worked like a charm after restoring an Option 21 from
last month and then a full IFS save from last Friday night overtop of the
21 restore.


This is one heck of a box, and for the most part, IBM does one heck of a
job supporting it.  Also key in such an 'easy' (although fretful) recovery
was the fact that we dutifully switch these apps to the backup box from
afternoon until the next morning each month.  Each month the redundancy is
tested and it keeps us in good practice for what's required to get things
moved over to the backup box (and then moved back to the main) - in short,
we KNOW we can trust our backup server and we are comfortable doing the
switch.   Even if you know your redundancy works... if at all possible it's
very worthwhile to keep testing and practicing it!





Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2014 by MIDRANGE dot COM and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available here. If you have questions about this, please contact