RE: IPL frequency (was Who has a big QTEMP?) -- MIDRANGE-L

re: graceful solutions to out of space/memory condition

I agree that it would be better if something more graceful than a system
hang or forced IPL was implemented.  Obviously I have no idea of the
feasibility of that, since I'm not an OS/400 developer.

re: runaway jobs

The question here is how should the system assign blame?  Start ending each
job that asks for a new disk extent until there's space again?  That would
get the culprit, but would also likely knock over a bunch of "innocent" jobs
as well.  That might be preferable to knocking over all the jobs on the
system (effectively what the forced IPL does), but it isn't going to make
anyone who happens to get nicked happy.

As for your example, I like it.  Of course, I'm a proponent of journalling
and commitment control, unlike so many AS/400 programmers who "know" that
these things add too much overhead and complexity to be used.  Sigh.

Anyway, my point wasn't so much to argue as to point out differences that
would partly explain why performance drops off as DASD fills, etc.  I'm sure
there are ways to improve what the system does when it runs out of space.
I'm also sure that it isn't likely to get a real high priority in the
development lab unless someone really important to IBM makes an issue of it
(maybe you're that customer, I know I'm not <grin>.

Dave Shaw
Spartan International, Inc.
Spartanburg, SC

-----Original Message-----
From: James David Rich [mailto:james@dansfoods.com]

On Wed, 30 Aug 2000, Shaw, David wrote:

> >Yes, my other machines (AIX and linux) do run well with very nearly full
> >disks.
> 
> Don't those OS's have their virtual memory swap spaces pre-allocated on
> disk?  That makes a machine much more tolerant of having nearly full
disks,
> since some of the "fullness" is actually empty space waiting to be used.
> Undersize your swap space, though, and you have the same kinds of problems
> when it fills up.  The /400's virtual memory model is a lot more flexible,
> but does need space to work in, and doesn't provide a way to "hide" it in
a
> pre-allocated space.

So then does the need to IPL come from an out of memory situation?  Even
an OS that uses pre-allocated swap space can use up all available
memory.  A graceful (non-rebooting) solution should exist.  Not that a
graceful solution necessarily exists on the above platforms (my linux box
has died once in an OOM situation, not sure about AIX or newer linux
kernels).

> >the reasons people shut them down.  It surprises me that an IPL was a
> >requirement to simply get some disk space back.
> 
> It actually isn't a requirement.  I think the reason that the machine does
> it on its own when it maxes out is because the designers found it to be
the
> simplest way to:
> 
> 1) stop the (perhaps unknown) process(es) filling the space, and
> 2) get back space from temporary objects, QTEMP libraries, and virtual
> memory so that the machine can run normally again.
> 
> I caught a system once at 99.98%, and managed to stop the job filling it
up
> and recover the disk space without an IPL (job was doing saves to save
files
> in QTEMP - without compression).  It was at the point where no one could
> sign on - attempts to do so would hang.  Fortunately I knew which job it
had
> to be and was able to kill it at the console, which we left signed on to
> QSYSOPR at all times.

This is a perfect example of where I think a solution is needed.  Why
should one run away job be able to bring down the system?  Why doesn't the
OS issue a 'no space left on device' error to the offending job?  Why does
the OS give up on swap space that it had already allocated and still
needs?  It seems that if the system was able to procure x amount of swap
it shouldn't give up that swap until it is no longer needed.

As for applications writing to disk shouldn't there be logic in the
program that does something like:

[write something]
[check for error (like disk full)]
[write something else]
[check for error]
[if no error]
        [commit]
[endif]

For the example you gave above shouldn't the OS issue an error to the save
job at which point the save job would die (hopefully cleaning up after
itself), leaving the system usable?

James Rich
james@dansfoods.com

+---
| This is the Midrange System Mailing List!
| To submit a new message, send your mail to MIDRANGE-L@midrange.com.
| To subscribe to this list send email to MIDRANGE-L-SUB@midrange.com.
| To unsubscribe from this list send email to MIDRANGE-L-UNSUB@midrange.com.
| Questions should be directed to the list owner/operator:
david@midrange.com
+---
+---
| This is the Midrange System Mailing List!
| To submit a new message, send your mail to MIDRANGE-L@midrange.com.
| To subscribe to this list send email to MIDRANGE-L-SUB@midrange.com.
| To unsubscribe from this list send email to MIDRANGE-L-UNSUB@midrange.com.
| Questions should be directed to the list owner/operator: david@midrange.com
+---