RE: System running out of ports -- MIDRANGE-L

I see you point now. After getting current with PTF's, IPLing the
system, Checking your TCP attributes if the problem still exists, then a
reclaim storage can find and fix damaged objects in the OS that may
correct the long linger. I also had that issue and I thing I was back
on V5R1. Current cume and PTF groups fixed my issues.

Once again you have opened my eyes.

Chris Bipes
Director of Information Services
CrossCheck, Inc.

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx
[mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Scott Klement
Sent: Wednesday, September 24, 2008 4:42 PM
To: Midrange Systems Technical Discussion
Subject: Re: System running out of ports

Please... before blowing off my suggestion, please consider my logic.

First, let me explain what TIME-WAIT means... TCP is considered a
"reliable" protocol. That means that every time something is sent, an
"ACK" response is sent to notify the remote side that the data was
received. If no ACK arrives, the data is resent. This guarantees that
data is either received, or you get an error (or time out). You'll
always know whether data is received or not, and that's the goal of a
reliable protocol.

The problem comes at the end of a session. When the socket is closed,
you send a FIN (for "finished") packet to the remote side. Likewise, it
sends you a FIN packet. That's how you know the session is closed. You
respond with an ACK so the remote side knows you received the FIN.

What happens if that final ACK gets lost in the network? You never
send an ACK in response to an ACK (that would create an endless loop!).
So you never know if that ACK was received. If the network is flaky
and drops your ACK packet, the remote side will eventually re-send the
FIN. So you need to be ready to acknowledge the repeated FIN, otherwise
the remote side could get stuck in a never-ending loop of sending FIN...

etc.

TIME-WAIT state is the state where the socket is sitting and waiting
"just in case" that ACK got lost. It waits a long enough time (2 mins
by default) that if there's going to be a re-send of the FIN, it will
have already received it...

So that's what TIME-WAIT is... it's waiting "just in case" that FIN is
re-sent, and it has to send a new ACK.

Since you say it's idle, it's obviously not receiving any FIN packets.
It's just sitting there... so you're not really waiting for any sort
of network events, and there's no network errors keeping it alive. It's
just sitting and waiting for the timeout when it knows that there's no
second FIN coming...

The default "time wait timeout" value is 120 seconds (2 mins) which
should be adequate for any situation. However, IBM provides the option
to change this, so you can make the timeout value higher or lower.

That's why my first suggestion to Carl was to prompt the CHGTCPA command
and see the value set there for the "time wait timeout".

However, on my box (V5R4) the maximum value for the timeout is 14400
seconds, which is only 4 hours. Carl's message suggests that TCP
connections are lingering for even more than 4 hours... which shouldn't
be possible. There's no way to misconfigure something in your network
settings to make it stay open for days.

That makes me think that something is "corrupt" in the software (IBM
supplied) that handles TCP/IP. If the timeout value is stored in an
integer, and somehow an extra bit got flipped in that integer, the
corruption could cause the timeout value to be a very long time indeed
-- and that would perfectly explain Carl's problem.

So my second suggestion was to IPL the computer. (Though, actually
ending/restarting TCPIP should work just as well). This would reset
the software values, and therefore hopefully fix the problem.

If that didn't fix the problem, it suggests that the corruption isn't
just in main storage, but is also in auxiliary storage, since it's been
reloaded after an IPL. So in that case, you'd want to run a RCLSTG to
fix the problem. The programs that IBM provides as part of the
operating system ARE stored on disk, after all. As are the
configuration values.

I *have* run into this problem before. The last time I ran into it, an
IPL fixed the problem.