I am user of a cluster where I test my distributed program which consists of several computing "worker" programs running on different nodes. The workers communicate with one another using tcp, port 9900. I execute the workers remotely using ssh OpenSSH_3.4p1, SSH protocols 1.5/2.0, OpenSSL 0x0090602f The administrator has recently downgraded ssh to 3.4p1 after 3.5p1 also exhibited the same problem. The cluster is running Rocks v2.2, Linux kernel 2.4.18-27.7.xsmp I noticed that some (not all) of the nodes start having problems after I ran my program on them. When I try to ssh to those nodes, ssh freezes. But after a while (usually 1/2-1 day), ssh to those affected nodes return the message ssh_exchange_identification: Connection closed by remote host Thank you in advance for any help. Andrew
Created attachment 313 [details] ssh -vvv compute-0-14 This is what is being printed out on the screen before ssh hangs when I ran ssh -vvv compute-0-14 where compute-0-14 is the name of an affected node in the cluster.
can you keep non-ssh tcp connections up for similar periods? i.e have you ruled out network-level issues? You might also want to set "ClientAliveInterval=120" in sshd_config to work around these.
Thank you for your help. The problem was actually caused by remote file access through NFS. Remote execution of my program on a compute node causes some file to be updated. I was not aware that the filesystem where these files resided is actually mounted through NFS. It seems that after a remote execution of my program finishes, and ssh returns, somehow sshd at the remote side still needs to deal with the remote file access, which is somehow stuck, thus making ssh stuck. I have not encountered similar problem after I removed the remote file access from my program. Andrew
Mass change of RESOLVED bugs to CLOSED