Re: 400 Disk Storage Crisis -- MIDRANGE-L

Thanks for the list of things for me to look into further.

As it turned out, the local tornado seems to have disrupted Internettraffic, so the discussion list did not get to see my post about theproblem, until long after I had fixed the immediate problem.


lessons learned
============
The tornado struck Evansville a few miles from where we located.

I store off-site backup 2 different places. I now thinking they should bein waterproof baggies.

It was the disk SPACE that got wiped out, NOT the disk CONTENTS. I did notknow that initially.

It is my understanding that an IPL includes the recovery of the deletedspool file disk space. We did another Monday wee hours after disk spacewas back to normal. However, I think there is a System Value (I have tocheck which one) related to how many days worth of recovery ... it might besmart if there is some evening during the week when the nite clerk getsdone early, to use GO POWER to schedule another wee hours IPL.

I am now certain the runaway job did it, although in my kill kill killanything that I could, I also lost details on what precisely went wrong inthat program.

I reran that job and it ran fine, so I now suspect that some of the otherstuff I was running Saturday needs to be on the list of what not to run atthe same time. Also I need to study how to put safeties on jobs to makerunway less likely.

I have security auditing going, and infrequently remember to check whatinfo got there ... I think I ought to put the command to do that on one ofmy menus, as a reminder to do it more often.


Another disk space management issue
============================
We have files that "grow" to grab more disk space as needed.

This means that when I kill a lot of ancient records, the files are still"reserving" space for growth that may be excessive, so I need to downsizesome without losing the growth support ... that is for reasons of usingdisk space wisely. There is also a performance issue associated withspotting files nearing their next growth step, and perhaps upsizing thembefore that happens in middle of work day. I expect the answer is a queryover an *OUTFILE, looking for files at the extremes of not having muchgrowth left, or having excessive growth space.


Backups ... why so infrequent
======

Historically we have had to recover stuff from backups perhaps once every3-4 months, since we moved to BPCS. In most cases, the recoveries notimpact most end users, so they not a witness to the rate of recovery ...what would typically happen is one person deletes a query definition thatthey think no one using, then a month later someone else tries to run itand it bombs.

When we were on MAPICS we had to go back to last backup, on average, acouple times a week. When bad storms came thru the area, we had to do soseveral times a day. I would ask management to have everyone off thesystem until the storm passed, but they never thought it neccessary,because it was clerical people further down the food chain that had torekey all their work several times.

Until the move to other offices, we used to run backups/400 almost everyweek nite. A lot of people leave their work stations signed on in themiddle of some update ... different people different nites ... if I killtheir sessions, then that crashes what they updating, and there's otherstuff to have to fix, so I figured a backup every 2nd or 3rd nite wasprobably tolerable.I opted not to move my residence to the new AS/400 site city, and askedabout people at that site who could perhaps run the kind of backup I hadbeen doing.

Unfortunately we have people who need to be on the system until the weehours. When I was doing backups where the AS/400 located, I could waituntil people needing to do updates were all off the system (last one at 2am) then force restricted state, and do a full backup. But in the currentreality, I can only kick people off, start a backup, not in a restrictedstate, then people sign on again, and various critical files not get intothe backup, that their inquiry is accessing.

The morning crew comes on at 5 am, so if I going to do a nitely backup, Ihave to start it before 3 am, and have some way to enforce nite crewinquiry staying off the system during that time frame. I do not have thepolitical clout in the company food chain to get that. It is partly amatter of user education ... the users are accustomed to signing on andaccessing the system whenever they please. This is one of the reasons whyI have to visit the site when doing end fiscal. I have to get a completebackup (I like to get two of them, one before and one after the fiscalupdate jobs), and I have to get restricted state when running end fiscalupdates.

One of the people at the AS/400 site changes the backup tape media for me,and runs the cleaning tape as needed.


-
Al Macintyre  http://www.ryze.com/go/Al9Mac
BPCS/400 Computer Janitor ... see
http://radio.weblogs.com/0107846/stories/2002/11/08/bpcsDocSources.html