Bug 1633 - Race condition in ssh-agent AUTH_CONNECTION
Summary: Race condition in ssh-agent AUTH_CONNECTION
Status: CLOSED FIXED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh-agent (show other bugs)
Version: 5.2p1
Hardware: ix86 Linux
: P2 normal
Assignee: Assigned to nobody
URL:
Keywords: patch
: 1135 (view as bug list)
Depends on: 1254
Blocks: V_5_4
  Show dependency treegraph
 
Reported: 2009-08-19 06:27 AEST by noodle10000
Modified: 2010-03-26 10:51 AEDT (History)
3 users (show)

See Also:


Attachments
fall back to select() on read/write interruptions (1.60 KB, patch)
2009-08-19 06:37 AEST, Damien Miller
no flags Details | Diff
fix the root cause of the problem too (1.94 KB, patch)
2009-08-19 06:50 AEST, Damien Miller
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description noodle10000 2009-08-19 06:27:26 AEST
I have the same issue as encountered in bug 1254.  When launching thousands of SSH connections via a script (the open source taktuk/kanif) using ssh-agent to forward keys, occasionally I will see ssh-agent hang and consume 100% of one CPU.  This does not happen every time, but around 1 out of every 3 runs.

I have compiled 5.2p1 which also exhibits the same issue.   strace at the time of the hang reports an EAGAIN error on a read call.  A few printfs isolated the code in question to be the same as mentioned in bug 1254, but the suggested workaround (add a usleep before trying the read again) does not work in any case.

This issue is also reported at http://www.plug.org/pipermail/plug/2009-April/033800.html



+++ This bug was initially created as a clone of Bug #1254 +++

In function after_select(), case AUTH_CONNECTION, the do-loop which
handles socket reads will peg my CPU at close to 100% when errno is
EAGAIN.

I'm running FreeBSD 6.2 pre-release, with OpenSSH built from the ports
collection (security/openssh-portable).

The problem only occurs for me while running an automation script that
sends commands through ssh to about a hundred servers at at time, and I
have not been successful in identifying which server causes the
problem.  But the bottom line is that the read fails with errno EAGAIN,
and continues to fail in a very tight loop until a timeout occurs at
some point.

My work-around was to introduce a tiny sleep before the continue
statement in that loop, which is apparently enough to allow some data
to become available for reading, and makes the problem go away.

I will attach my work-around as a patch, realizing that usleep() is
probably not available on all platforms.
Comment 1 Damien Miller 2009-08-19 06:37:05 AEST
Created attachment 1670 [details]
fall back to select() on read/write interruptions

Could you try to reproduce the problem with this patch applied?
Comment 2 Damien Miller 2009-08-19 06:46:27 AEST
... and here is a theory on how it occurs:

on a heavily loaded ssh-agent, we can create a new socket in the ssh-agent.c:after_select() loop, via the AUTH_SOCKET case calling new_socket(). This might increase sockets_alloc past the value it had when execution enters after_select().

The for() loop in after_select() can therefore progress into sockets that did not exist when select() and, critically, prepare_select() was called. prepare_select() sizes and clears the fd_sets that select() subsequently populates and after_select() tests.

So a new AUTH_CONNECTION socket whose creation increments sockets_alloc can cause after_select to test past the end of the allocated fd_sets and might (depending on what it finds) treat them as ready for reading.
Comment 3 Damien Miller 2009-08-19 06:50:43 AEST
Created attachment 1671 [details]
fix the root cause of the problem too
Comment 4 noodle10000 2009-08-19 07:14:46 AEST
Patch applied to the ssh-agent.c in openssh-5.2p1 (RCS revision 1.159).  I have now successfully run our scripts against 6000 hosts for the first time, so it appears to have solved the issue.  

I will be soak-testing over the next 48 hours and will update after that. 

(and thanks for the very quick response!)
Comment 5 Damien Miller 2009-08-27 03:30:12 AEST
Have you been able to reproduce the problem with patch #1671 applied?
Comment 6 noodle10000 2009-08-27 19:38:03 AEST
(In reply to comment #5)
> Have you been able to reproduce the problem with patch #1671 applied?

We've not had any further problems with ssh-agent since applying #1671 - looks like it's fixed.  Thanks!
Comment 7 Damien Miller 2009-09-02 00:43:47 AEST
patch applied. This will be in openssh-5.4.
Comment 8 Damien Miller 2009-10-06 15:02:33 AEDT
Mass move of RESOLVED bugs to CLOSED now that 5.3 is out.
Comment 9 Damien Miller 2009-11-20 10:41:44 AEDT
*** Bug 1135 has been marked as a duplicate of this bug. ***
Comment 10 Darren Tucker 2010-03-26 10:51:18 AEDT
With the release of 5.4p1, this bug is now considered closed.