Bug 1398

Summary: slave ssh sessions enter a never-ending blocking state
Product: Portable OpenSSH Reporter: Greg Shively <gregory_shively>
Component: sshAssignee: Assigned to nobody <unassigned-bugs>
Status: CLOSED FIXED    
Severity: normal CC: djm
Priority: P2    
Version: 4.7p1   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 1353    
Attachments:
Description Flags
Patch to close client_fd
none
A quick regression to test bug
none
Allows a multiplex slave to exit and generate a true exit value
none
Cleanup duplicate diff hunks. none

Description Greg Shively 2007-12-08 11:18:28 AEDT
Created attachment 1388 [details]
Patch to close client_fd

I've currently looking at increasing the MAX_SESSIONS to increase the number of slave ssh processes to be multiplex where I ran into the default maximum filehandle limit on a test machine (Solaris 8). I've found a similiar set of patches in the cvs repos that is similiar to a dirty patch that I came up with; so I've been implementing the patch in the repos. The patch includes clientloop.c@1.182,  monitor_fdpass.h@1.4 and monitor_fdpass.c@1.13.

The problem I've hit is that in the cleanup-code for a failed mm_receive_fd() in the client_process_control() function, the client_fd filehandle is left open and lost. The effect is that the slave ssh process blocks and never returns even if filehandles are freed due to other slave processes closing. I've attached a patch that I think fixes this problem.

I've also created a simple regression, but I'm not exactly sure how well it will work in other locations. But to manually test issue:

in one window/session:
  ( ulimit -Sn 11 ; exec ./ssh -vMS  /tmp/cntl otherhost ) 

in another window/session:
  ./ssh -vS  /tmp/cntl otherhost

The process in the 2nd window blocks until the master ssh process exits. I would think it would be better to have the slave exit as soon as possible since it will never be able to access otherhost.

I've also seen another interesting effect of this process, I've been testing on Solaris 8 and SLES 10 machines currently, and it only seems to effect the Solaris machine - if the filehandle limit is hit from the accept() call in the same client_process_control() function, it blocks the slave ssh session until the another slave ends, freeing some filehandles. I seem to be able to manually reproduce this by changing the previous ulimit value to 15 and running a third process in the same way as the 2nd. The 3rd process will block, but once the 2nd process exist, the 3rd would be let in. I couldn't seem to reproduce on a Linux machine; and I think this is the "right" thing to do anyway.
Comment 1 Greg Shively 2007-12-08 11:21:12 AEDT
Created attachment 1389 [details]
A quick regression to test bug
Comment 2 Greg Shively 2007-12-12 11:27:44 AEDT
Comment on attachment 1389 [details]
A quick regression to test bug

Regression failed on a different machine.
Comment 3 Greg Shively 2007-12-19 08:54:46 AEDT
Created attachment 1398 [details]
Allows a multiplex slave to exit and generate a true exit value
Comment 4 Greg Shively 2007-12-20 07:59:27 AEDT
Created attachment 1399 [details]
Cleanup duplicate diff hunks.

Found some more equivalent changes in CVS repos. Removed the duplicate diff hunks.
Comment 5 Damien Miller 2008-01-20 07:51:46 AEDT
Patch applied - thanks!
Comment 6 Damien Miller 2008-03-31 15:22:51 AEDT
Fix shipped in 4.9/4.9p1 release.