Bug 1398 - slave ssh sessions enter a never-ending blocking state
Summary: slave ssh sessions enter a never-ending blocking state
Status: CLOSED FIXED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: 4.7p1
Hardware: All All
: P2 normal
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks: V_4_8
  Show dependency treegraph
 
Reported: 2007-12-08 11:18 AEDT by Greg Shively
Modified: 2008-03-31 15:22 AEDT (History)
1 user (show)

See Also:


Attachments
Patch to close client_fd (206 bytes, patch)
2007-12-08 11:18 AEDT, Greg Shively
no flags Details | Diff
A quick regression to test bug (1.69 KB, application/octet-stream)
2007-12-08 11:21 AEDT, Greg Shively
no flags Details
Allows a multiplex slave to exit and generate a true exit value (1.04 KB, patch)
2007-12-19 08:54 AEDT, Greg Shively
no flags Details | Diff
Cleanup duplicate diff hunks. (470 bytes, patch)
2007-12-20 07:59 AEDT, Greg Shively
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Greg Shively 2007-12-08 11:18:28 AEDT
Created attachment 1388 [details]
Patch to close client_fd

I've currently looking at increasing the MAX_SESSIONS to increase the number of slave ssh processes to be multiplex where I ran into the default maximum filehandle limit on a test machine (Solaris 8). I've found a similiar set of patches in the cvs repos that is similiar to a dirty patch that I came up with; so I've been implementing the patch in the repos. The patch includes clientloop.c@1.182,  monitor_fdpass.h@1.4 and monitor_fdpass.c@1.13.

The problem I've hit is that in the cleanup-code for a failed mm_receive_fd() in the client_process_control() function, the client_fd filehandle is left open and lost. The effect is that the slave ssh process blocks and never returns even if filehandles are freed due to other slave processes closing. I've attached a patch that I think fixes this problem.

I've also created a simple regression, but I'm not exactly sure how well it will work in other locations. But to manually test issue:

in one window/session:
  ( ulimit -Sn 11 ; exec ./ssh -vMS  /tmp/cntl otherhost ) 

in another window/session:
  ./ssh -vS  /tmp/cntl otherhost

The process in the 2nd window blocks until the master ssh process exits. I would think it would be better to have the slave exit as soon as possible since it will never be able to access otherhost.

I've also seen another interesting effect of this process, I've been testing on Solaris 8 and SLES 10 machines currently, and it only seems to effect the Solaris machine - if the filehandle limit is hit from the accept() call in the same client_process_control() function, it blocks the slave ssh session until the another slave ends, freeing some filehandles. I seem to be able to manually reproduce this by changing the previous ulimit value to 15 and running a third process in the same way as the 2nd. The 3rd process will block, but once the 2nd process exist, the 3rd would be let in. I couldn't seem to reproduce on a Linux machine; and I think this is the "right" thing to do anyway.
Comment 1 Greg Shively 2007-12-08 11:21:12 AEDT
Created attachment 1389 [details]
A quick regression to test bug
Comment 2 Greg Shively 2007-12-12 11:27:44 AEDT
Comment on attachment 1389 [details]
A quick regression to test bug

Regression failed on a different machine.
Comment 3 Greg Shively 2007-12-19 08:54:46 AEDT
Created attachment 1398 [details]
Allows a multiplex slave to exit and generate a true exit value
Comment 4 Greg Shively 2007-12-20 07:59:27 AEDT
Created attachment 1399 [details]
Cleanup duplicate diff hunks.

Found some more equivalent changes in CVS repos. Removed the duplicate diff hunks.
Comment 5 Damien Miller 2008-01-20 07:51:46 AEDT
Patch applied - thanks!
Comment 6 Damien Miller 2008-03-31 15:22:51 AEDT
Fix shipped in 4.9/4.9p1 release.