Re: Fwd: microsoft 99.9999 -- MIDRANGE-L

I shared some of these posts with some friends & collegues on other platforms 
who are not on the list & here's what one of them had to say ... my comments 
at the end.

Subj:    Re: Fwd: microsoft 99.9999

Al:

I don't think anyone takes Microsoft seriously when they talk about
reliability.

Some of the servers we use have 99.999% up time.  This is accomplished by
running specialized operating systems (i.e. - much simpler and task
specific) on machines which are totally redundant.  If one hard
drive/mother board/whatever goes down, data packets are routed to the other
machine which is running in lockstep.  If each machine has 99.7% up time
you'll get (in theory anyway) 99.999% (or "5 nines").  The production
volume on these computers is low, and applications are generally limited to
functions required to operate the phone system, so they're really expensive
and are only used for mission critical stuff.  I think one of the keys is
that only one application runs per machine and the same guy makes all of
the hardware and software, so everything is rung out and all of the
interactions are known.  That, and since only trained personnel will touch
the server, the operating system can dispense with a lot of user
friendliness to cut down on the complexity of the code.  The only time they
ever go down is if someone screws up (for example, if traffic grows beyond
the point were one machine can handle it and no one is paying attention).
I don't see how this kind of up time could ever be achieved by third party
applications running on a closed operating system.

Anyway, my 2 cents.

________________________________________________________
W. Scott Gaines
Cincinnati Bell Wireless

Ok - Al comments - down time & for what reason can be a sore point with a lot 
of people.  Something goes down.  Users do not much care for the detail - 
they just want it back up again.  Management wants rate of downage to 
diminish.

Now we have hardware that is 99.98 % up time on one server OS because it was 
designed by IBM to be that good & they keep improving on it.

Then it gets loaded up with software that requires several hours of dedicated 
time in every 24 ... in our case we have an ERP with cost roll-ups & material 
planning regenerations & capacity planning & file reorg & order purges & all 
sorts of things whose design is such that we cannot be having other people 
updating the same records at the same time.  We used to run Billing in the 
late afternoon after the last of the day's shipments went out & it used to 
crash all the time because it not designed to work at same time as someone 
updating the same customer orders that are being billed.  Now we delay 
Billing until all office staff gone home for the evening.

I figure we have 50 users getting 10 hours per day value out of the system & 
I figure we average 1 crash per day with inadequately written software that 
inconveniences 2-3 people for 1 1/2 hour ... let's say average of 5 people 
hours per day messed up due to software crashes.  Fixing them keeps me busy.  
The causes are a mixture of user education & problems with our ERP.

Example ... someone tries to run a query on-line that happens to be accessing 
2 files that are over a million records each & the complexity of the query is 
such that it locks ut the whole system.  Now he knows that it should be run 
on JOBQ & that if he does it there it will be finished in 2 minutes, but 
sometimes he forgets.

Example ..... no one reads the documentation, except when they ask me to do a 
modification & I studying it to figure out best place to insert hooks & then 
I say "Hey ... it says ... I did not know it supposed to work that way ... 
turns out no one else in the company knew that either."  I just had a case of 
alleged data corruption that turned out that is the way the system supposed 
to work & no one knew it.

Example ... our ERP is using ungodly large fields pushing the limits of the 
high level language involved ... apparently if you multiply a 30 digit number 
with 9 decimal places by another 30 digit number with 9 decimal places & the 
programming language can only handle numbers that go up to 30 digits with 9 
decimal places, you can have some rounding errors, and they are worst when 
multiplying simple numbers, like zero times one can have a result of 
zero
plus or minus 0.00001
various other answers
and I have found several programs where this is systemic.
and I asked the vendor if they would fix it & they disagreed that this was a 
bug.
probably my poor communication skills

I think what I will want to have in the future is a score card figure on what 
the percentage of down time is, and the rate of bugs or other failures, for 
each ot the ingredients that go into providing service to the end users.

hardware
OS
software
telcom

There is a standard for each - general industry & our purchase in particular.
What performance are WE getting as opposed to other companies in our industry.
What improvements are achievable at what cost.

Right now we do not have good metrics as to where there is downtime.
MIS has an incident & we just fix it as quickly as we can.
But we do not have any good long term statistics on how often we have 
incidents & how long they take to fix.

There is also ATTITUDE 
With IBM they guarantee to fix it when it breaks.   They do not care whether 
the problem is with the hardware, OS. something stupid we did. IBM equipment, 
non-IBM equipment, they will fix the problem.  Of course we pay an arm & a 
leg to keep IBM service on our stuff.
With almost every other company there is high risk of finger pointing.  I 
have had cases of AT&T saying the problem is in GTE part of the phone line & 
GTE saying the problem is in AT&T part of the phone line & both wanting to 
bill us for wasting our time & me asking accounting to look up the contract 
to see which of them is accountable for our ma bell service because I do not 
want to be in the middle of this.

With PCs there is often the appearance that when something breaks down ... 
there is the presumption that the user did something stupid to contribute to 
it ... like in an automobile accident one of the drivers must have been at 
fault, like pilot error, while for the users there is the sense that the 
collection os stuff is just too fragile.
But I think the reality is that when a user is in a program that crashes, we 
just have no way of immediately knowing if it was user error, software bug, 
or connection gremlin.

Thank the internet for open help where we can go ask other people for 
suggestions.

MacWheel99@aol.com (Alister Wm Macintyre) (Al Mac)
AS/400 Data Manager & Programmer for BPCS 405 CD Rel-02 mixed mode (twinax 
interactive & batch) @ http://www.cen-elec.com Central Industries of 
Indiana--->Quality manufacturer of wire harnesses and electrical 
sub-assemblies - fax # 812-424-6838

+---
| This is the Midrange System Mailing List!
| To submit a new message, send your mail to MIDRANGE-L@midrange.com.
| To subscribe to this list send email to MIDRANGE-L-SUB@midrange.com.
| To unsubscribe from this list send email to MIDRANGE-L-UNSUB@midrange.com.
| Questions should be directed to the list owner/operator: david@midrange.com
+---