Bug 578 - SSH freezes on cluster machine
Summary: SSH freezes on cluster machine
Status: CLOSED FIXED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: -current
Hardware: ix86 Linux
: P2 normal
Assignee: OpenSSH Bugzilla mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-05-28 01:17 AEST by Andrew E. Santosa
Modified: 2004-04-14 12:24 AEST (History)
0 users

See Also:


Attachments
ssh -vvv compute-0-14 (1.45 KB, text/plain)
2003-05-28 01:25 AEST, Andrew E. Santosa
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew E. Santosa 2003-05-28 01:17:48 AEST
I am user of a cluster where I test my distributed program which consists
of several computing "worker" programs running on different nodes. The
workers communicate with one another using tcp, port 9900. I execute
the workers remotely using ssh 

OpenSSH_3.4p1, SSH protocols 1.5/2.0, OpenSSL 0x0090602f

The administrator has recently downgraded ssh to 3.4p1 after 3.5p1
also exhibited the same problem.

The cluster is running Rocks v2.2, Linux kernel 2.4.18-27.7.xsmp

I noticed that some (not all) of the nodes start having problems after I ran
my program on them. When I try to ssh to those nodes, ssh freezes.
But after a while (usually 1/2-1 day), ssh to those affected nodes return the
message

ssh_exchange_identification: Connection closed by remote host

Thank you in advance for any help.


Andrew
Comment 1 Andrew E. Santosa 2003-05-28 01:25:37 AEST
Created attachment 313 [details]
ssh -vvv compute-0-14

This is what is being printed out on the screen before ssh hangs
when I ran

ssh -vvv compute-0-14

where compute-0-14 is the name of an affected node in the cluster.
Comment 2 Damien Miller 2003-06-04 19:26:09 AEST
can you keep non-ssh tcp connections up for similar periods? i.e have you ruled
out network-level issues?

You might also want to set "ClientAliveInterval=120" in sshd_config to work
around these.
Comment 3 Andrew E. Santosa 2003-06-04 22:04:06 AEST
Thank you for your help. 

The problem was actually caused by remote file access through NFS.
Remote execution of my program on a compute node causes some file
to be updated. I was not aware that the filesystem where these files
resided is actually mounted through NFS.

It seems that after a remote execution of my program finishes,
and ssh returns, somehow sshd at the remote side still needs to deal with
the remote file access, which is somehow stuck, thus making ssh stuck.
I have not encountered similar problem after I removed the remote file
access from my program.

Andrew
Comment 4 Damien Miller 2004-04-14 12:24:19 AEST
Mass change of RESOLVED bugs to CLOSED