Re: -- MIDRANGE-L

In Wed, Mar 19, 2008 at 3:11 PM, Burns, Bryan <Bryan_Burns@xxxxxxxxxxxx> wrote:

Looking for input on the best approach for the alternative phone number to use in Service Contact Information. For years, I've been the primary and a coworker has been the alternative, but I've been considering changing the alternative to my cell number so IBM will contact me at home for hardware failures. However, if I'm not reachable - let's say I'm out of town - then our organization will not be aware of the issue.

If your first way of knowing that you had a hardware fault or any
other major malfunctions is through contact by IBM, you're doing
something very, very wrong.

If your infrastructure is important, it needs to be monitored. In all
kinds of ways. Hardware is just a very basic part of that. There are
things much more important than just the hardware and that is the
software.

While detecting a failed disk drive is something thats easy to test
for, there are many other factors you'll need to monitor in order to
make sure that your system stays up and running.

We monitor almost any part of our Environment - from physical to
logical. Does the AC run? At what temperature is the server room? When
was the last UPS self test? What's the voltage baseline on our
incoming UPS power? Does our software work? How many orders are added
per hour? Whats our servers CPU load? Whats our servers memory load?
Is the data transfer to our partners still up and running? Does
communication with all branch offices work? etc. etc.

It's important that you also establish and monitor baselines of a
variety of performances and alert your staff if something is off.
Example, if during work hours the load on your System i is usually
40-60%, there is probably something off if the load is 10% or 95%.
Maybe it's just a day with few orders, or one with a lot of orders -
but someone actually needs to take a look and see if the current
behaviour is intended or not.

There's a lot of software out there for complete Systems Management
and monitoring. Unfortunately, there is no product thats IMHO "good at
everything". There's IBM Director which can be used for a lot of
hardware monitoring and other stuff and it works very well with IBM
hardware. Then there's Microsoft SC Operations Manager which is very
good at monitoring windows servers on almost any hardware, but
not-so-good at monitoring network equipment and piss poor at
monitoring non-Windows software (lot's of plugins for 3rd Party
Windows Software, though). I deve^W hacked together a i5/OS SCOM
management pack, but it leaves a lot to be desired.

HP also has an offering, and i believe Dell does too. There are also
some very agnostic open source packages like Nagios. I've used Nagios
before switching to SCOM - Nagios is pretty unix-focused, and there
are some i5/OS related management packs out in the wild. There are
also Windows Management packs for Nagios. As such, Nagios might be a
very good idea if you do not have any budget or are a Unix-centric
shop.