Re: Socket Server getting 'hung' & how to diagnose -- RPG400-L

Hello Jerry,

One of the aggravating things about communications programs is that there's so much that can go wrong in a communications link. It's nearly impossible to go through a testing process that will create every possible failure that could happen. It's hard to replicate communications errors in a test environment!

So you have to code VERY defensively. Assume everything can and will go wrong, and be ready for it.

For some reason this second instance may do one of two things and I'm not sure why. Over the course of 24 hours I've seen up to six connections established from the one client. My first question is why would connections other than the first one be initiated, when the client doesn't show any disconnect at their end?

All sorts of things can cause this. You should never assume that the remote site will properly disconnect. There may be a temporary network error of some sort that causes the remote side to see the connection being reset, while your side doesn't see it.

This is compounded by the fact that many security auditors force you to block ICMP messages (the messages that might report a routing error or something like that.)

The point is... never just assume that a connection is sane. Your program should have timeouts implemented on every send/recv, as well as some sort of "keepalive" that sends data periodically to/from the client to see if it's still a valid connection. That way, if you get no response for a long enough time, you'll know something is wrong, and can disconnect and reset the connection.

The second problem is that in viewing netstat *cnn for the client, after a random period of valid activity, typically in hours, I've monitored and seen where the Idle Time for the client is being reset to zero after every 90 seconds. Something is happening causing the idle time to reset itself every 90 seconds.

Probably a keepalive at the TCP level. Your program would not receive any data, but some "under the covers" data would be sent a regular interval.

When I look at the call stack of the socket server it is at the statement in my program that performs the "eval rc = select(max+1: %addr(readset):%addr(writeset): %addr(excpset): to)", and the job status is SELW. If I look in the joblog I can see when I last received valid data (which was received more than 90 seconds ago). So it appears that something is trying to get through but nothing's happening. At this point the only option is to force the program to end and restart it.

Why is that the only option?! Your program should have a means of detecting this and it should reset itself.

I realize that getting all of this stuff right is a tough job. Communications programming isn't easy -- but it's fun.

I am stumped. How would I go about diagnosing the server socket just to be sure the error isn't happening with my socket server?

What good would it do to diagnose it? You might find out that a VPN or router or something sent out a bad packet. So what? You need to expect things to go wrong. For some reason the connection died, and your program needs to be able to detect that and recover.

It doesn't really matter WHY it failed. What's it going to do? Assuming that the Internet is involved, you can't control all of the places that a TCP/IP packet goes through. Knowing that some 3rd party has a flaky piece of equipment isn't really very useful.

What's useful is expecting that things go wrong sometimes, and having a graceful way to recover from it.

I'm 99% sure the problem lies with the client since they've told me
that they have other facilities that experience similar connectivity
issues.

That's probably true. But, on the other hand, shouldn't your server be ready to handle these connectivity problems? It should be able to gracefully deal with any sort of failure.

Otherwise, anyone with the inclination can cause your server to fail by deliberately sending a bad packet. (That's called a "Denial of service" attack, BTW)