Hello Jerry,
One of the aggravating things about communications programs is that
there's so much that can go wrong in a communications link. It's nearly
impossible to go through a testing process that will create every
possible failure that could happen. It's hard to replicate
communications errors in a test environment!
So you have to code VERY defensively. Assume everything can and will go
wrong, and be ready for it.
For some reason this second instance may do one of two things and I'm not
sure why. Over the course of 24 hours I've seen up to six connections
established from the one client. My first question is why would
connections other than the first one be initiated, when the client doesn't
show any disconnect at their end?
All sorts of things can cause this. You should never assume that the
remote site will properly disconnect. There may be a temporary network
error of some sort that causes the remote side to see the connection
being reset, while your side doesn't see it.
This is compounded by the fact that many security auditors force you to
block ICMP messages (the messages that might report a routing error or
something like that.)
The point is... never just assume that a connection is sane. Your
program should have timeouts implemented on every send/recv, as well as
some sort of "keepalive" that sends data periodically to/from the client
to see if it's still a valid connection. That way, if you get no
response for a long enough time, you'll know something is wrong, and can
disconnect and reset the connection.
The second problem is that in viewing netstat *cnn for the client, after a
random period of valid activity, typically in hours, I've monitored and
seen where the Idle Time for the client is being reset to zero after every
90 seconds. Something is happening causing the idle time to reset itself
every 90 seconds.
Probably a keepalive at the TCP level. Your program would not receive
any data, but some "under the covers" data would be sent a regular interval.
When I look at the call stack of the socket server it is
at the statement in my program that performs the "eval rc = select(max+1:
%addr(readset):%addr(writeset): %addr(excpset): to)", and the job status
is SELW. If I look in the joblog I can see when I last received valid data
(which was received more than 90 seconds ago). So it appears that
something is trying to get through but nothing's happening. At this point
the only option is to force the program to end and restart it.
Why is that the only option?! Your program should have a means of
detecting this and it should reset itself.
I realize that getting all of this stuff right is a tough job.
Communications programming isn't easy -- but it's fun.
I am stumped. How would I go about diagnosing the server socket just to be
sure the error isn't happening with my socket server?
What good would it do to diagnose it? You might find out that a VPN or
router or something sent out a bad packet. So what? You need to expect
things to go wrong. For some reason the connection died, and your
program needs to be able to detect that and recover.
It doesn't really matter WHY it failed. What's it going to do?
Assuming that the Internet is involved, you can't control all of the
places that a TCP/IP packet goes through. Knowing that some 3rd party
has a flaky piece of equipment isn't really very useful.
What's useful is expecting that things go wrong sometimes, and having a
graceful way to recover from it.
I'm 99% sure the problem lies with the client since they've told me
that they have other facilities that experience similar connectivity
issues.
That's probably true. But, on the other hand, shouldn't your server be
ready to handle these connectivity problems? It should be able to
gracefully deal with any sort of failure.
Otherwise, anyone with the inclination can cause your server to fail by
deliberately sending a bad packet. (That's called a "Denial of service"
attack, BTW)
As an Amazon Associate we earn from qualifying purchases.