Bug 1208 - ssh fails to remove control socket when ssh is abnormally terminated
Summary: ssh fails to remove control socket when ssh is abnormally terminated
Status: CLOSED FIXED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: 5.0p1
Hardware: All All
: P2 normal
Assignee: Damien Miller
URL: http://subversion.tigris.org/issues/s...
Keywords:
Depends on: 1329
Blocks: V_5_1
  Show dependency treegraph
 
Reported: 2006-07-10 13:36 AEST by Daniel Black
Modified: 2008-07-22 12:08 AEST (History)
2 users (show)

See Also:


Attachments
debug logs + version info (5.33 KB, text/plain)
2006-07-10 13:38 AEST, Daniel Black
no flags Details
debug logs + version info (without hpn patch) (4.83 KB, text/plain)
2006-07-10 14:31 AEST, Daniel Black
no flags Details
svn.strace (35.92 KB, text/plain)
2006-07-10 14:59 AEST, Daniel Black
no flags Details
svn.strace.12827 (ssh process 1) (30.94 KB, text/plain)
2006-07-10 15:00 AEST, Daniel Black
no flags Details
svn.strace.12837 (ssh process 2) (38.20 KB, text/plain)
2006-07-10 15:03 AEST, Daniel Black
no flags Details
subversion patch (823 bytes, patch)
2006-07-10 16:03 AEST, Damien Miller
no flags Details | Diff
Delay start of listen on multiplex control socket (794 bytes, patch)
2007-06-22 13:50 AEST, Damien Miller
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Black 2006-07-10 13:36:50 AEST
Using svn over ssh creates a mastersocket as expected however it doesn't remove it when completed.

The next svn command finds the socket however no process is attached to it.
Comment 1 Daniel Black 2006-07-10 13:38:01 AEST
Created attachment 1159 [details]
debug logs + version info
Comment 2 Damien Miller 2006-07-10 13:59:48 AEST
You seem to be using a modified "hpn" SSH client. Can you reproduce this with an unpatched OpenSSH?
Comment 3 Daniel Black 2006-07-10 14:31:15 AEST
Created attachment 1160 [details]
debug logs + version info (without hpn patch)

yes it is reproduceable without the hpn patch.
Comment 4 Damien Miller 2006-07-10 14:37:35 AEST
it looks like svn might be ungracefully killing its ssh process. Could you try running it under strace "strace -ffo svn.strace svn up"? 

This should create a couple of svn.strace.$PID files, one of which should contain ssh information.
Comment 5 Daniel Black 2006-07-10 14:59:29 AEST
Created attachment 1161 [details]
svn.strace

I think your right - subversion process - line 594
Comment 6 Daniel Black 2006-07-10 15:00:40 AEST
Created attachment 1162 [details]
svn.strace.12827 (ssh process 1)
Comment 7 Daniel Black 2006-07-10 15:03:30 AEST
Created attachment 1163 [details]
svn.strace.12837 (ssh process 2)

I read somewhere on the subversion bug records that two transports is normal.

cvs-ssh works ok for me and so does normal ssh shell with ControlMaster so I'll put a subversion bug in soon.

Thanks for your time.
Comment 9 Damien Miller 2006-07-10 15:36:33 AEST
Yes, it looks like svn is using SIGKILL to "tidy up" its ssh transport. 

That is definitely a bug in svn: shutting a transport down using SIGKILL gives the transport no opportunity to clean up after itself. There is nothing that OpenSSH can do to remove the socket in this case as SIGKILL cannot be caught.

BTW Debian bug 313371 is unrelated.

Comment 10 Darren Tucker 2006-07-10 15:39:26 AEST
(In reply to comment #8)
> Cross references
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=313371

That's an excellent example of throwing the baby out with the bathwater: it would appear that svn is using SIGKILL simply because they want the "terminated by signal" messages, but if that's the only reason then all they had to do was use "ssh -q"...
Comment 11 Darren Tucker 2006-07-10 15:44:23 AEST
(In reply to comment #10)
> they want the "terminated by signal" messages,

Doh.  Make that "they *don't* want" ...

Still, as Damien says there's nothing ssh can do to clean up if svn is SIGKILL'ing it.

Maybe ssh could recover more gracefully from a stale control socket?
Comment 12 Damien Miller 2006-07-10 15:48:44 AEST
It seems like the Subversion developers have known about the problem for some time, but obviously haven't fixed it:

http://svn.haxx.se/dev/archive-2006-03/1020.shtml
http://svn.haxx.se/users/archive-2005-11/0194.shtml
Comment 13 Damien Miller 2006-07-10 16:03:41 AEST
Created attachment 1164 [details]
subversion patch

Try this patch to subversion. It changes its broken behaviour of SIGKILL'ing ssh while avoiding printing the termination message by setting ssh's "-q" flag. (completely untested)

If it works, please feed it back to the subversion developers.
Comment 14 Daniel Black 2006-07-10 18:21:29 AEST
(In reply to comment #10)
Thanks Damien. Patch works beautifully. Tested well.

As per URL I've issued the subversion folks with a bug.

I've attached comparative straces on the subversion bug report.

 
(In reply to comment #10)
> Maybe ssh could recover more gracefully from a stale control socket?
yes probably. I'm sure subversion isn't the only thing that could kill a ssh client.

(In reply to comment #9)
> BTW Debian bug 313371 is unrelated.
Here is Peter basically saying that ssh is overly verbose on a couple of types of signals and provide a patch to make the obvious ones silent.

Thanks Damian and Darren, much appreciate your help.
Comment 15 Daniel Black 2006-07-28 15:22:52 AEST
better summary as per comment #11
Comment 16 Björn Jacke 2007-01-17 22:54:58 AEDT
another place where a stale control socket is left behind is when a "LocalCommand" hangs (like in bug #1232) and the user has to kill the ssh client during connection setup by ctrl-c here.
Comment 17 Damien Miller 2007-06-22 13:50:53 AEST
Created attachment 1309 [details]
Delay start of listen on multiplex control socket

In response to comment #16:

This patch should fix your problem - can you verify?
Comment 18 Daniel Black 2007-06-22 18:25:03 AEST
Not enough sorry.
The below example used OpenSSH_4.6p1 with the attached patch
A ssh connection was made (and a control socket was created). The ssh connection was killed with a kill -9 from another terminal. An attempt to reconnect was made and this failed as per below:

localhost $ ssh devgentoo
Enter passphrase for key '/home/dan/.ssh/id_dsa':
Last login: Fri Jun 22 08:05:10 2007 from 59.167.43.249
HP ProLiant DL380 G5 Server
dragonheart@woodpecker ~ $ Killed

localhost ~ $ ssh devgentoo
Control socket connect(/home/dan/.ssh/master-dragonheart@dev.gentoo.org:22): Connection refused
Enter passphrase for key '/home/dan/.ssh/id_dsa':
ControlSocket /home/dan/.ssh/master-dragonheart@dev.gentoo.org:22 already exists

localhost ~ $ ssh -v
OpenSSH_4.6p1, OpenSSL 0.9.8d 28 Sep 2006

Recommendation:
a) delete the control socket when a connection refused occurs (the connection obviously isn't much use anyway)

I'm not sure of all the things that could cause a connection refused but given its local I don't think there would be too many options.

b) create a different socket name if you get a "already exists" error like maketemp though a) would be easier.
Comment 19 Darren Tucker 2007-06-22 19:35:29 AEST
(In reply to comment #18)
> Recommendation:
> a) delete the control socket when a connection refused occurs (the
> connection obviously isn't much use anyway)

I don't think that's a the right thing to do.  You'll get a connection refused if the agent has reached its listen backlog limit (eg if you have a bunch of clients attempting to access it all in quick succession) but in that case it's just a temporary condition.
Comment 20 Daniel Black 2007-06-22 21:40:18 AEST
ah thought there must of been a reason you didn't delete it.

How about making the 'already exists' on an attempted reconnection a warning and fall back to a new connection?
Comment 21 Damien Miller 2008-06-12 16:55:42 AEST
something similar to attachment #1309 [details] has been committed, and we are looking at implementing fallback to TCP connection on mux client errors.
Comment 22 Damien Miller 2008-06-13 10:19:18 AEST
the patch in attachment #1309 [details] has been applied applied; will be in openssh-5.1
Comment 23 Damien Miller 2008-07-22 12:08:44 AEST
Mass update RESOLVED->CLOSED after release of openssh-5.1