Bug 1263 - connection sharing often freezes
Summary: connection sharing often freezes
Status: CLOSED FIXED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: 4.5p1
Hardware: PPC Mac OS X
: P2 major
Assignee: Assigned to nobody
URL:
Keywords:
: 1382 (view as bug list)
Depends on:
Blocks:
 
Reported: 2006-11-20 12:57 AEDT by Vincent Lefevre
Modified: 2008-07-22 12:12 AEST (History)
5 users (show)

See Also:


Attachments
master log (2.85 KB, text/plain)
2006-11-20 12:58 AEDT, Vincent Lefevre
no flags Details
typed commands and output (1.88 KB, text/plain)
2006-11-20 12:59 AEDT, Vincent Lefevre
no flags Details
Revert TCP_NODELAY patch (602 bytes, patch)
2007-01-22 14:16 AEDT, Darren Tucker
no flags Details | Diff
fd passing test program (4.48 KB, text/plain)
2007-01-22 16:52 AEDT, Damien Miller
no flags Details
script to call scp repeatedly as a ControlMaster slave (1.11 KB, text/plain)
2007-10-26 10:58 AEST, Alexander Staubo
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vincent Lefevre 2006-11-20 12:57:39 AEDT
I use connection sharing with Subversion; it was working well in the past, but for several days (after I updated OpenSSH, in particular), I've had many problems with it: ssh often freezes. I've tried with two servers, and I have the same problem with both. For the test, I've used the following wrapper:

#!/usr/bin/env zsh

source ~/.zshenv
source ~/.zalias
unset DISPLAY

[[ $1 != ay || -e /tmp/ssh-ay.vinc17.org-22-lefevre ]] ||
  (echo "Starting master" >&2; ssh -fMNv ay >& ~/master.log)

echo "executing ssh" >&2
ssh -Cvvv "$@"

Note: I use only the -v option for the master since I can't reproduce the bug with -vv or -vvv (perhaps I haven't tried hard).

On the remote side (ay, which is a Debian testing/unstable machine with OpenSSH_4.3p2 Debian-5.1), I have a .ssh/rc file that contains:

case "$OSTYPE" in
  solaris*)  ip="`echo $SSH_CONNECTION | sed 's/ .*//'`";;
  *)         ip="${SSH_CONNECTION%% *}";;
esac

: ${HOST:=`hostname`}
echo "Connected to $HOST (from $ip)" >&2

if read proto cookie && [ -n "$DISPLAY" ]; then
  [ "$HOST" = prunille.vinc17.org ] && PATH=${PATH}:/usr/X11R6/bin
  if [ `echo $DISPLAY | cut -c1-10` = 'localhost:' ]; then
    echo add unix:`echo $DISPLAY | cut -c11-` $proto $cookie
  else
    echo add $DISPLAY $proto $cookie
  fi | xauth -q -
fi

I'm going to attach the master.log file and a file containing the two (identical) commands I've typed from the terminal on my Mac OS X machine (prunille), and you'll see the point where the second one freezes.
Comment 1 Vincent Lefevre 2006-11-20 12:58:53 AEDT
Created attachment 1209 [details]
master log
Comment 2 Vincent Lefevre 2006-11-20 12:59:46 AEDT
Created attachment 1210 [details]
typed commands and output
Comment 3 Damien Miller 2006-11-20 13:04:18 AEDT
Can you replicate this with unpatched openssh-4.5 at each end? Debian seems to include quite a few patches that may be breaking things.
Comment 4 Vincent Lefevre 2006-11-20 13:09:05 AEDT
I forgot to say: after the freeze, I can typing Ctrl-\ to interrupt the svn command (Ctrl-C isn't always sufficient). This makes the master die after writing the following line to stderr (here, the master.log file):

mm_receive_fd: recvmsg: expected received 1 got 0
Comment 5 Darren Tucker 2006-11-20 13:13:26 AEDT
Another thing to watch for: when the connection is in the hung state, is the "SendQ" column in  the output of netstat corresponding to the ssh connection non-zero and increasing?  If so that's indicative of a network problem (probably MTU).
Comment 6 Darren Tucker 2006-11-20 13:15:37 AEDT
(In reply to comment #5)
> Another thing to watch for: when the connection is in the hung state,
> is the "SendQ" column in  the output of netstat corresponding to the
> ssh connection non-zero and increasing?  If so that's indicative of a
> network problem (probably MTU).

Forgot to add: you need to check this on both client and server.

Comment 7 Vincent Lefevre 2006-11-20 13:25:54 AEDT
The Send-Q column has only 0's on both the client and the server.
Comment 8 Vincent Lefevre 2006-11-20 14:13:13 AEDT
(In reply to comment #3)
> Can you replicate this with unpatched openssh-4.5 at each end? Debian
> seems to include quite a few patches that may be breaking things.

I also get a freeze, but I don't get the same debug messages!

prunille:~/wd> svn diff -c 14924 > /dev/null
executing ssh
OpenSSH_4.5p1, OpenSSL 0.9.8d 28 Sep 2006
debug1: Reading configuration data /Users/vinc17/.ssh/config
debug1: /Users/vinc17/.ssh/config line 5: Deprecated option "FallBackToRsh"
debug1: Applying options for ay
debug1: Applying options for *

When ssh doesn't freeze, the next debug line is:

debug3: Normalising mapped IPv4 in IPv6 address

The master.log still ends with:

debug1: client_input_channel_req: channel 1 rtype exit-status reply 0
debug1: channel 1: free: client-session, nchannels 2
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: channel 0: free: client-session, nchannels 1
debug1: fd 26 clearing O_NONBLOCK

and the server debug messages ended with:

debug1: session_by_channel: session 0 channel 0
debug1: session_close_by_channel: channel 0 child 0
debug1: session_close: session 0 pid 0
debug2: channel 0: gc: user detached
debug2: channel 0: is dead
debug2: channel 0: garbage collecting
debug1: channel 0: free: server-session, nchannels 1
debug3: channel 0: status: The following connections are open:
  #0 server-session (t4 r0 i3/0 o3/0 fd 9/9 cfd -1)

debug3: channel 0: close_fds r 9 w 9 e -1 c -1

in the first test, and in a second test:

debug1: server_input_channel_req: channel 0 request exec reply 0
debug1: session_by_channel: session 0 channel 0
debug1: session_input_channel_req: session 0 req exec
debug2: fd 9 setting O_NONBLOCK
debug3: fd 9 is O_NONBLOCK
debug2: fd 11 setting O_NONBLOCK
debug2: channel 0: read 413 from efd 11
debug2: channel 0: rwin 131072 elen 413 euse 1
debug2: channel 0: sent ext data 413
debug2: channel 0: read 35 from efd 11
debug2: channel 0: rwin 130659 elen 35 euse 1
debug2: channel 0: sent ext data 35
Comment 9 Vincent Lefevre 2006-11-20 14:24:50 AEDT
I've managed to produce a freeze with a master started with -vv (but it was difficult).

Server debug messages (unpatched v4.5p1):
[...]
debug1: server_input_channel_req: channel 0 request exec reply 0
debug1: session_by_channel: session 0 channel 0
debug1: session_input_channel_req: session 0 req exec
debug2: fd 9 setting O_NONBLOCK
debug3: fd 9 is O_NONBLOCK
debug2: fd 11 setting O_NONBLOCK
debug2: channel 0: read 413 from efd 11
debug2: channel 0: rwin 131072 elen 413 euse 1
debug2: channel 0: sent ext data 413
debug2: channel 0: read 35 from efd 11
debug2: channel 0: rwin 130659 elen 35 euse 1
debug2: channel 0: sent ext data 35

In the terminal (where I typed svn):
[...]
executing ssh
OpenSSH_4.5p1, OpenSSL 0.9.8d 28 Sep 2006
debug1: Reading configuration data /Users/vinc17/.ssh/config
debug1: /Users/vinc17/.ssh/config line 5: Deprecated option "FallBackToRsh"
debug1: Applying options for ay
debug1: Applying options for *

And the master.log file:
debug1: channel 0: free: client-session, nchannels 1
debug1: fd 26 clearing O_NONBLOCK
debug2: client_process_control: accepted tty 0, subsys 0, cmd svnserve -t
debug2: client_process_control: got fds stdin 27, stdout 28, stderr 29
debug2: fd 27 setting O_NONBLOCK
debug2: fd 28 setting O_NONBLOCK
debug2: fd 26 setting O_NONBLOCK
debug1: channel 0: new [client-session]
debug2: channel 0: send open
debug2: callback start
debug2: client_session2_setup: id 0
debug1: Sending environment.
debug1: Sending env LC_CTYPE = en_US.ISO8859-1
debug2: channel 0: request env confirm 0
debug1: Sending env LANG = POSIX
debug2: channel 0: request env confirm 0
debug1: Sending env LC_COLLATE = POSIX
debug2: channel 0: request env confirm 0
debug1: Sending command: svnserve -t
debug2: channel 0: request exec confirm 0
debug2: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug2: channel 0: rcvd adjust 131072
debug2: channel 0: rcvd ext data 413
debug2: channel 0: written 413 to efd 29
debug2: channel 0: rcvd ext data 35
debug2: channel 0: written 35 to efd 29
debug1: fd 30 clearing O_NONBLOCK
debug2: client_process_control: accepted tty 0, subsys 0, cmd svnserve -t
Comment 10 Darren Tucker 2006-11-20 22:42:55 AEDT
On the server when it's hung, can you do "strace -p [pidofsshd]" (the unprivileged one) and see what it's waiting on?  The equivalent on the client would be great if it's available (for both master and slave) but I don't know what, if any, the equivalent might be on OS X.

Also, I noticed that the descriptors mentioned on the client side are relatively high.  Are you really running 5-15 sessions concurrently?  If not there might be a descriptor leak in there somewhere (not sure if it's contributing to the problem you're seeing though).
Comment 11 Damien Miller 2007-01-22 12:48:51 AEDT
When do the connections freeze? Is it at random, after a period of inactivity, during large transfers, etc.?
Comment 12 Vincent Lefevre 2007-01-22 13:19:20 AEDT
(In reply to comment #11)
> When do the connections freeze? Is it at random, after a period of
> inactivity, during large transfers, etc.?

At random. The only parameter that seems to have an effect on the freezes is the debug level of the master: with -vv, it is difficult to reproduce a freeze, and with -vvv, I couldn't reproduce a single freeze.

Note: I won't be able to do other tests before Friday.
Comment 13 Darren Tucker 2007-01-22 14:16:02 AEDT
Created attachment 1230 [details]
Revert TCP_NODELAY patch

Since you mentioned that it's only started happening recently, this patch is worth a try: it reverts the following change:

- dtucker@cvs.openbsd.org 2006/03/13 08:33:00
[packet.c]
Set TCP_NODELAY for all connections not just "interactive" ones.  Fixes
poor performance and protocol stalls under some network conditions (mindrot bugs #556 and #981). Patch originally from markus@, ok djm@

Also, you said that the trouble started when you upgraded OpenSSH: which version did you upgrade from, and was that on the client, server or both?
Comment 14 Damien Miller 2007-01-22 16:52:12 AEDT
Created attachment 1231 [details]
fd passing test program

I *think* (i.e. I'm not sure) that I have seen this on OS X too, when opening subsequent connections on an existing control master. It seems to be getting stuck while passing file descriptors from the slave to the master (which suggests a kernel bug of some sort), however my attempt to make a test case (attached) never fails.

Perhaps there is some subtlety around non-blocking file descriptors or running into ulimits?
Comment 15 Vincent Lefevre 2007-02-16 03:15:57 AEDT
(In reply to comment #13)
> Also, you said that the trouble started when you upgraded OpenSSH:
> which version did you upgrade from, and was that on the client, server
> or both?

If fact, I had tried to revert to an older version, and the freeze was still occurring. So, it could be due to another upgrade. I've done other tests without the patch to see if it still occurs, and I could reproduce it only once (the first time), whereas it was occurring much more often before.

I don't know if this is related, but I also had freeze problems with the man command (the freeze occurred in close(), called from pclose()), and it seems to occur much less often.

I wonder if the 2007-01-25 Mac OS X security update change anything concerning the freeze.
Comment 16 Damien Miller 2007-02-16 16:11:16 AEDT
Since I wrote comment #14, I have definitely seen it a number of times on OSX. I'm pretty sure it is an OSX kernel bug: I have never seen it on any other platform, and the problem occurs inside a syscall.

OTOH we could handle fd passing errors more gracefully.
Comment 17 Vincent Lefevre 2007-02-16 18:29:21 AEDT
I could reproduce the problem several times yesterday, even after recompiling ssh. Then I applied the patch. I couldn't reproduce the problem since, but I haven't tested it extensively.
Comment 18 Vincent Lefevre 2007-08-17 12:02:42 AEST
I don't know if this is related, but I've just had a frozen connection under Linux (Debian/unstable, openssh-client 1:4.6p1-5): after 6 minutes, I had to do a Ctrl-C. I retried just after the Ctrl-C, and the command completed almost immediately.
Comment 19 Damien Miller 2007-10-26 10:41:13 AEST
*** Bug 1382 has been marked as a duplicate of this bug. ***
Comment 20 Damien Miller 2007-10-26 10:44:21 AEST
The hangs that intermittently occurred when setting up a new multiplexing client are mitigated somewhat in -current (future openssh-4.8) as we have made FD passing errors non-fatal, so it is possible to take down a stuck client without killing the master process as well.
Comment 21 Alexander Staubo 2007-10-26 10:58:45 AEST
Created attachment 1371 [details]
script to call scp repeatedly as a ControlMaster slave

Simple stress test script by EricKow <mzwcp6002@sneakemail.com> which reproduces the error.
Comment 22 Alexander Staubo 2007-10-26 11:01:37 AEST
Every time I ctrl-C a hanging client, the Darwin kernel log spits out:

 Oct 22 13:52:46 Albin kernel[0]: sbdrop - count not zero

According to a Darwin-kernel post (http://lists.apple.com/archives/Darwin-kernel/2005/May/msg00011.html):

> It means that, in the process of clearing out a sockbuf (dropping a
> specified number of bytes), the mbuf chain ran out before the count
> got to zero. It used to be a panic, but has been replaced by
> printf/log. File a bug.

It may be relevant.
Comment 23 Alexander Staubo 2007-10-30 10:09:35 AEDT
Some good news: I have not been able to reproduce this bug (nor that kernel error message) on Leopard, at least not with the sshtester.pl script.
Comment 24 rdump 2008-05-14 08:50:51 AEST
This problem still occurs at the start of attempts to transfer large files (CD & DVD iso, O(500MB) or O(3GB)) via scp on Mac OS X 10.4.11 with OpenSSH 4.7p1.  It also occurs while trying to transfer a number of mid-sized files (video podcasts, O(25MB)) using rsync, but not scp.

An easy workaround for Mac OS 10 10.4.11 is to use the MacPorts install of OpenSSH 5.0p1 (e.g.: 'sudo port install openssh' to get "openssh 5.0p1_0").
Comment 25 Damien Miller 2008-05-15 10:53:16 AEST
Thanks for confirming the bug is fixed in recent OpenSSH - I'll close this bug.
Comment 26 Damien Miller 2008-07-22 12:12:19 AEST
Mass update RESOLVED->CLOSED after release of openssh-5.1