|
I shared some of these posts with some friends & collegues on other platforms who are not on the list & here's what one of them had to say ... my comments at the end. Subj: Re: Fwd: microsoft 99.9999 Al: I don't think anyone takes Microsoft seriously when they talk about reliability. Some of the servers we use have 99.999% up time. This is accomplished by running specialized operating systems (i.e. - much simpler and task specific) on machines which are totally redundant. If one hard drive/mother board/whatever goes down, data packets are routed to the other machine which is running in lockstep. If each machine has 99.7% up time you'll get (in theory anyway) 99.999% (or "5 nines"). The production volume on these computers is low, and applications are generally limited to functions required to operate the phone system, so they're really expensive and are only used for mission critical stuff. I think one of the keys is that only one application runs per machine and the same guy makes all of the hardware and software, so everything is rung out and all of the interactions are known. That, and since only trained personnel will touch the server, the operating system can dispense with a lot of user friendliness to cut down on the complexity of the code. The only time they ever go down is if someone screws up (for example, if traffic grows beyond the point were one machine can handle it and no one is paying attention). I don't see how this kind of up time could ever be achieved by third party applications running on a closed operating system. Anyway, my 2 cents. ________________________________________________________ W. Scott Gaines Cincinnati Bell Wireless Ok - Al comments - down time & for what reason can be a sore point with a lot of people. Something goes down. Users do not much care for the detail - they just want it back up again. Management wants rate of downage to diminish. Now we have hardware that is 99.98 % up time on one server OS because it was designed by IBM to be that good & they keep improving on it. Then it gets loaded up with software that requires several hours of dedicated time in every 24 ... in our case we have an ERP with cost roll-ups & material planning regenerations & capacity planning & file reorg & order purges & all sorts of things whose design is such that we cannot be having other people updating the same records at the same time. We used to run Billing in the late afternoon after the last of the day's shipments went out & it used to crash all the time because it not designed to work at same time as someone updating the same customer orders that are being billed. Now we delay Billing until all office staff gone home for the evening. I figure we have 50 users getting 10 hours per day value out of the system & I figure we average 1 crash per day with inadequately written software that inconveniences 2-3 people for 1 1/2 hour ... let's say average of 5 people hours per day messed up due to software crashes. Fixing them keeps me busy. The causes are a mixture of user education & problems with our ERP. Example ... someone tries to run a query on-line that happens to be accessing 2 files that are over a million records each & the complexity of the query is such that it locks ut the whole system. Now he knows that it should be run on JOBQ & that if he does it there it will be finished in 2 minutes, but sometimes he forgets. Example ..... no one reads the documentation, except when they ask me to do a modification & I studying it to figure out best place to insert hooks & then I say "Hey ... it says ... I did not know it supposed to work that way ... turns out no one else in the company knew that either." I just had a case of alleged data corruption that turned out that is the way the system supposed to work & no one knew it. Example ... our ERP is using ungodly large fields pushing the limits of the high level language involved ... apparently if you multiply a 30 digit number with 9 decimal places by another 30 digit number with 9 decimal places & the programming language can only handle numbers that go up to 30 digits with 9 decimal places, you can have some rounding errors, and they are worst when multiplying simple numbers, like zero times one can have a result of zero plus or minus 0.00001 various other answers and I have found several programs where this is systemic. and I asked the vendor if they would fix it & they disagreed that this was a bug. probably my poor communication skills I think what I will want to have in the future is a score card figure on what the percentage of down time is, and the rate of bugs or other failures, for each ot the ingredients that go into providing service to the end users. hardware OS software telcom There is a standard for each - general industry & our purchase in particular. What performance are WE getting as opposed to other companies in our industry. What improvements are achievable at what cost. Right now we do not have good metrics as to where there is downtime. MIS has an incident & we just fix it as quickly as we can. But we do not have any good long term statistics on how often we have incidents & how long they take to fix. There is also ATTITUDE With IBM they guarantee to fix it when it breaks. They do not care whether the problem is with the hardware, OS. something stupid we did. IBM equipment, non-IBM equipment, they will fix the problem. Of course we pay an arm & a leg to keep IBM service on our stuff. With almost every other company there is high risk of finger pointing. I have had cases of AT&T saying the problem is in GTE part of the phone line & GTE saying the problem is in AT&T part of the phone line & both wanting to bill us for wasting our time & me asking accounting to look up the contract to see which of them is accountable for our ma bell service because I do not want to be in the middle of this. With PCs there is often the appearance that when something breaks down ... there is the presumption that the user did something stupid to contribute to it ... like in an automobile accident one of the drivers must have been at fault, like pilot error, while for the users there is the sense that the collection os stuff is just too fragile. But I think the reality is that when a user is in a program that crashes, we just have no way of immediately knowing if it was user error, software bug, or connection gremlin. Thank the internet for open help where we can go ask other people for suggestions. MacWheel99@aol.com (Alister Wm Macintyre) (Al Mac) AS/400 Data Manager & Programmer for BPCS 405 CD Rel-02 mixed mode (twinax interactive & batch) @ http://www.cen-elec.com Central Industries of Indiana--->Quality manufacturer of wire harnesses and electrical sub-assemblies - fax # 812-424-6838 +--- | This is the Midrange System Mailing List! | To submit a new message, send your mail to MIDRANGE-L@midrange.com. | To subscribe to this list send email to MIDRANGE-L-SUB@midrange.com. | To unsubscribe from this list send email to MIDRANGE-L-UNSUB@midrange.com. | Questions should be directed to the list owner/operator: david@midrange.com +---
As an Amazon Associate we earn from qualifying purchases.
This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].
Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.