I use connection sharing with Subversion; it was working well in the past, but for several days (after I updated OpenSSH, in particular), I've had many problems with it: ssh often freezes. I've tried with two servers, and I have the same problem with both. For the test, I've used the following wrapper: #!/usr/bin/env zsh source ~/.zshenv source ~/.zalias unset DISPLAY [[ $1 != ay || -e /tmp/ssh-ay.vinc17.org-22-lefevre ]] || (echo "Starting master" >&2; ssh -fMNv ay >& ~/master.log) echo "executing ssh" >&2 ssh -Cvvv "$@" Note: I use only the -v option for the master since I can't reproduce the bug with -vv or -vvv (perhaps I haven't tried hard). On the remote side (ay, which is a Debian testing/unstable machine with OpenSSH_4.3p2 Debian-5.1), I have a .ssh/rc file that contains: case "$OSTYPE" in solaris*) ip="`echo $SSH_CONNECTION | sed 's/ .*//'`";; *) ip="${SSH_CONNECTION%% *}";; esac : ${HOST:=`hostname`} echo "Connected to $HOST (from $ip)" >&2 if read proto cookie && [ -n "$DISPLAY" ]; then [ "$HOST" = prunille.vinc17.org ] && PATH=${PATH}:/usr/X11R6/bin if [ `echo $DISPLAY | cut -c1-10` = 'localhost:' ]; then echo add unix:`echo $DISPLAY | cut -c11-` $proto $cookie else echo add $DISPLAY $proto $cookie fi | xauth -q - fi I'm going to attach the master.log file and a file containing the two (identical) commands I've typed from the terminal on my Mac OS X machine (prunille), and you'll see the point where the second one freezes.
Created attachment 1209 [details] master log
Created attachment 1210 [details] typed commands and output
Can you replicate this with unpatched openssh-4.5 at each end? Debian seems to include quite a few patches that may be breaking things.
I forgot to say: after the freeze, I can typing Ctrl-\ to interrupt the svn command (Ctrl-C isn't always sufficient). This makes the master die after writing the following line to stderr (here, the master.log file): mm_receive_fd: recvmsg: expected received 1 got 0
Another thing to watch for: when the connection is in the hung state, is the "SendQ" column in the output of netstat corresponding to the ssh connection non-zero and increasing? If so that's indicative of a network problem (probably MTU).
(In reply to comment #5) > Another thing to watch for: when the connection is in the hung state, > is the "SendQ" column in the output of netstat corresponding to the > ssh connection non-zero and increasing? If so that's indicative of a > network problem (probably MTU). Forgot to add: you need to check this on both client and server.
The Send-Q column has only 0's on both the client and the server.
(In reply to comment #3) > Can you replicate this with unpatched openssh-4.5 at each end? Debian > seems to include quite a few patches that may be breaking things. I also get a freeze, but I don't get the same debug messages! prunille:~/wd> svn diff -c 14924 > /dev/null executing ssh OpenSSH_4.5p1, OpenSSL 0.9.8d 28 Sep 2006 debug1: Reading configuration data /Users/vinc17/.ssh/config debug1: /Users/vinc17/.ssh/config line 5: Deprecated option "FallBackToRsh" debug1: Applying options for ay debug1: Applying options for * When ssh doesn't freeze, the next debug line is: debug3: Normalising mapped IPv4 in IPv6 address The master.log still ends with: debug1: client_input_channel_req: channel 1 rtype exit-status reply 0 debug1: channel 1: free: client-session, nchannels 2 debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 debug1: channel 0: free: client-session, nchannels 1 debug1: fd 26 clearing O_NONBLOCK and the server debug messages ended with: debug1: session_by_channel: session 0 channel 0 debug1: session_close_by_channel: channel 0 child 0 debug1: session_close: session 0 pid 0 debug2: channel 0: gc: user detached debug2: channel 0: is dead debug2: channel 0: garbage collecting debug1: channel 0: free: server-session, nchannels 1 debug3: channel 0: status: The following connections are open: #0 server-session (t4 r0 i3/0 o3/0 fd 9/9 cfd -1) debug3: channel 0: close_fds r 9 w 9 e -1 c -1 in the first test, and in a second test: debug1: server_input_channel_req: channel 0 request exec reply 0 debug1: session_by_channel: session 0 channel 0 debug1: session_input_channel_req: session 0 req exec debug2: fd 9 setting O_NONBLOCK debug3: fd 9 is O_NONBLOCK debug2: fd 11 setting O_NONBLOCK debug2: channel 0: read 413 from efd 11 debug2: channel 0: rwin 131072 elen 413 euse 1 debug2: channel 0: sent ext data 413 debug2: channel 0: read 35 from efd 11 debug2: channel 0: rwin 130659 elen 35 euse 1 debug2: channel 0: sent ext data 35
I've managed to produce a freeze with a master started with -vv (but it was difficult). Server debug messages (unpatched v4.5p1): [...] debug1: server_input_channel_req: channel 0 request exec reply 0 debug1: session_by_channel: session 0 channel 0 debug1: session_input_channel_req: session 0 req exec debug2: fd 9 setting O_NONBLOCK debug3: fd 9 is O_NONBLOCK debug2: fd 11 setting O_NONBLOCK debug2: channel 0: read 413 from efd 11 debug2: channel 0: rwin 131072 elen 413 euse 1 debug2: channel 0: sent ext data 413 debug2: channel 0: read 35 from efd 11 debug2: channel 0: rwin 130659 elen 35 euse 1 debug2: channel 0: sent ext data 35 In the terminal (where I typed svn): [...] executing ssh OpenSSH_4.5p1, OpenSSL 0.9.8d 28 Sep 2006 debug1: Reading configuration data /Users/vinc17/.ssh/config debug1: /Users/vinc17/.ssh/config line 5: Deprecated option "FallBackToRsh" debug1: Applying options for ay debug1: Applying options for * And the master.log file: debug1: channel 0: free: client-session, nchannels 1 debug1: fd 26 clearing O_NONBLOCK debug2: client_process_control: accepted tty 0, subsys 0, cmd svnserve -t debug2: client_process_control: got fds stdin 27, stdout 28, stderr 29 debug2: fd 27 setting O_NONBLOCK debug2: fd 28 setting O_NONBLOCK debug2: fd 26 setting O_NONBLOCK debug1: channel 0: new [client-session] debug2: channel 0: send open debug2: callback start debug2: client_session2_setup: id 0 debug1: Sending environment. debug1: Sending env LC_CTYPE = en_US.ISO8859-1 debug2: channel 0: request env confirm 0 debug1: Sending env LANG = POSIX debug2: channel 0: request env confirm 0 debug1: Sending env LC_COLLATE = POSIX debug2: channel 0: request env confirm 0 debug1: Sending command: svnserve -t debug2: channel 0: request exec confirm 0 debug2: callback done debug2: channel 0: open confirm rwindow 0 rmax 32768 debug2: channel 0: rcvd adjust 131072 debug2: channel 0: rcvd ext data 413 debug2: channel 0: written 413 to efd 29 debug2: channel 0: rcvd ext data 35 debug2: channel 0: written 35 to efd 29 debug1: fd 30 clearing O_NONBLOCK debug2: client_process_control: accepted tty 0, subsys 0, cmd svnserve -t
On the server when it's hung, can you do "strace -p [pidofsshd]" (the unprivileged one) and see what it's waiting on? The equivalent on the client would be great if it's available (for both master and slave) but I don't know what, if any, the equivalent might be on OS X. Also, I noticed that the descriptors mentioned on the client side are relatively high. Are you really running 5-15 sessions concurrently? If not there might be a descriptor leak in there somewhere (not sure if it's contributing to the problem you're seeing though).
When do the connections freeze? Is it at random, after a period of inactivity, during large transfers, etc.?
(In reply to comment #11) > When do the connections freeze? Is it at random, after a period of > inactivity, during large transfers, etc.? At random. The only parameter that seems to have an effect on the freezes is the debug level of the master: with -vv, it is difficult to reproduce a freeze, and with -vvv, I couldn't reproduce a single freeze. Note: I won't be able to do other tests before Friday.
Created attachment 1230 [details] Revert TCP_NODELAY patch Since you mentioned that it's only started happening recently, this patch is worth a try: it reverts the following change: - dtucker@cvs.openbsd.org 2006/03/13 08:33:00 [packet.c] Set TCP_NODELAY for all connections not just "interactive" ones. Fixes poor performance and protocol stalls under some network conditions (mindrot bugs #556 and #981). Patch originally from markus@, ok djm@ Also, you said that the trouble started when you upgraded OpenSSH: which version did you upgrade from, and was that on the client, server or both?
Created attachment 1231 [details] fd passing test program I *think* (i.e. I'm not sure) that I have seen this on OS X too, when opening subsequent connections on an existing control master. It seems to be getting stuck while passing file descriptors from the slave to the master (which suggests a kernel bug of some sort), however my attempt to make a test case (attached) never fails. Perhaps there is some subtlety around non-blocking file descriptors or running into ulimits?
(In reply to comment #13) > Also, you said that the trouble started when you upgraded OpenSSH: > which version did you upgrade from, and was that on the client, server > or both? If fact, I had tried to revert to an older version, and the freeze was still occurring. So, it could be due to another upgrade. I've done other tests without the patch to see if it still occurs, and I could reproduce it only once (the first time), whereas it was occurring much more often before. I don't know if this is related, but I also had freeze problems with the man command (the freeze occurred in close(), called from pclose()), and it seems to occur much less often. I wonder if the 2007-01-25 Mac OS X security update change anything concerning the freeze.
Since I wrote comment #14, I have definitely seen it a number of times on OSX. I'm pretty sure it is an OSX kernel bug: I have never seen it on any other platform, and the problem occurs inside a syscall. OTOH we could handle fd passing errors more gracefully.
I could reproduce the problem several times yesterday, even after recompiling ssh. Then I applied the patch. I couldn't reproduce the problem since, but I haven't tested it extensively.
I don't know if this is related, but I've just had a frozen connection under Linux (Debian/unstable, openssh-client 1:4.6p1-5): after 6 minutes, I had to do a Ctrl-C. I retried just after the Ctrl-C, and the command completed almost immediately.
*** Bug 1382 has been marked as a duplicate of this bug. ***
The hangs that intermittently occurred when setting up a new multiplexing client are mitigated somewhat in -current (future openssh-4.8) as we have made FD passing errors non-fatal, so it is possible to take down a stuck client without killing the master process as well.
Created attachment 1371 [details] script to call scp repeatedly as a ControlMaster slave Simple stress test script by EricKow <mzwcp6002@sneakemail.com> which reproduces the error.
Every time I ctrl-C a hanging client, the Darwin kernel log spits out: Oct 22 13:52:46 Albin kernel[0]: sbdrop - count not zero According to a Darwin-kernel post (http://lists.apple.com/archives/Darwin-kernel/2005/May/msg00011.html): > It means that, in the process of clearing out a sockbuf (dropping a > specified number of bytes), the mbuf chain ran out before the count > got to zero. It used to be a panic, but has been replaced by > printf/log. File a bug. It may be relevant.
Some good news: I have not been able to reproduce this bug (nor that kernel error message) on Leopard, at least not with the sshtester.pl script.
This problem still occurs at the start of attempts to transfer large files (CD & DVD iso, O(500MB) or O(3GB)) via scp on Mac OS X 10.4.11 with OpenSSH 4.7p1. It also occurs while trying to transfer a number of mid-sized files (video podcasts, O(25MB)) using rsync, but not scp. An easy workaround for Mac OS 10 10.4.11 is to use the MacPorts install of OpenSSH 5.0p1 (e.g.: 'sudo port install openssh' to get "openssh 5.0p1_0").
Thanks for confirming the bug is fixed in recent OpenSSH - I'll close this bug.
Mass update RESOLVED->CLOSED after release of openssh-5.1