| Summary: | sshd gets stuck: select() in packet_read_seqnr waits indefinitely | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Portable OpenSSH | Reporter: | Matt Day <openssh> | ||||||||||
| Component: | sshd | Assignee: | Damien Miller <djm> | ||||||||||
| Status: | CLOSED FIXED | ||||||||||||
| Severity: | major | CC: | djm, dtucker, t8m | ||||||||||
| Priority: | P2 | Keywords: | patch | ||||||||||
| Version: | 4.2p1 | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | All | ||||||||||||
| URL: | http://marc.info/?t=117394251600035 | ||||||||||||
| Bug Depends on: | |||||||||||||
| Bug Blocks: | 1452 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Matt Day
2007-09-17 13:02:25 AEST
Comment on attachment 1348 [details] latest version of fix -- this has been tested >+ while ((ret = select(connection_in + 1, setp, NULL, NULL, >+ packet_wait_tvp)) == -1 && I think you need to reset packet_wait_tv before each select call. Linux at least will modify the timeout parameter to the contain the amount of time remaining, and apparently this behaviour is permitted by POSIX. (In reply to comment #1) > I think you need to reset packet_wait_tv before each select call. Linux > at least will modify the timeout parameter to the contain the amount of > time remaining, and apparently this behaviour is permitted by POSIX. Good observation -- I agree. In addition to the patch, this sort of select-loop with timeout appears in the following OpenSSH 4.7 places: * conloop() (ssh-keyscan.c) * timeout_connect (sshconnect.c) I'm not familiar with that code, but must not be portable for the same reason. In the patch, the purpose of the select-loop is to wait for the socket to become ready. In OpenSSH 4.7, the packet.c routines wait forever. So the patch introduces a timeout which is enabled on systems using the SSH keepalive. So, I think the patch is actually *correct* on Linux. On systems like FreeBSD that do not change the select-timeval, the select-loop would start the timer over from the beginning each time EAGAIN or EINTR occurred. So if signals kept going off, sshd could still get stuck indefinitely in that loop. So I believe the correct fix should keep track of its own time and pass the correct amount of remaining time period into select(). > In addition to the patch, this sort of select-loop with timeout appears > in the following OpenSSH 4.7 places: > * conloop() (ssh-keyscan.c) I'm not so fussed about this one. > * timeout_connect (sshconnect.c) This is fixed in CVS -current. See: http://www.openbsd.org/cgi-bin/cvsweb/src/usr.bin/ssh/sshconnect.c.diff?r1=1.201&r2=1.202 > So, I think the patch is actually *correct* on Linux. On systems like > FreeBSD that do not change the select-timeval, the select-loop would > start the timer over from the beginning each time EAGAIN or EINTR > occurred. So if signals kept going off, sshd could still get stuck > indefinitely in that loop. Actually, the patch is incorrect everywhere and possibly badly broken on Linux. On Linux, the timeout will get smaller and smaller and will eventually timeout an active connection. On other platforms the behavior is undefined. The logic that needs to be implemented is: 1. Decrease the timeout on EINTR or EAGAIN and retry the select 2. Reset the timeout on a successful read (In reply to comment #3) > Actually, the patch is incorrect everywhere and possibly badly broken > on Linux. On Linux, the timeout will get smaller and smaller and will > eventually timeout an active connection. On other platforms the > behavior is undefined. Ah, you're right of course. I forgot the timeout value is being stored in a global by packet_set_timeout, which is only being called at session establishment time. Created attachment 1350 [details] revised patch This implements the logic I described in comment #3 Shouldn't the ms_to_timeval(&timeout, ms_remain); be called inside the for(;;) loop? As you said the contents of the timeout is undefined on some platforms after returning from the select(). Created attachment 1351 [details]
Set remain_ms correctly
You are right: that diff only reset the timeout on successful reads, but not on interrupted select()s.
One concern though: does this impose a timelimit of (ClientAliveInterval * ClientAliveCountMax) on userauth?
I'm not sure whether that is desirable, because we already have LoginGraceTime for that. I'm not sure what the solution is there, perhaps set the timeout to MAX(LoginGraceTime, ClientAliveInterval * ClientAliveCountMax), turn the timeout off after KEX, or never turn it on in the server to begin with?
(In reply to comment #7) I think the solution is not to enable the timeout in the server at all - it already has LoginGraceTime that kicks off right after accept(). So, just ignore the sshd.c hunk of this patch. (In reply to comment #8) > (In reply to comment #7) > I think the solution is not to enable the timeout in the server at all > - it already has LoginGraceTime that kicks off right after accept(). > > So, just ignore the sshd.c hunk of this patch. The problem also exists in the session (eg during rekey) and LoginGraceTime doesn't help there. We could set the timeout to MAX(ClientAliveInterval * ClientAliveCountMax, LoginGraceTime) before auth, and ClientAliveInterval * ClientAliveCountMax afterward. Created attachment 1446 [details]
activate server-side timeout after auth completes
Almost identical patch, but activate the sshd timeout right after the LoginGraceTime alarm is cancelled. I.e. LoginGraceTime will cover initial KEX and auth, and this new timeout will cover the rest of the running protocol (including re-KEX).
client behaviour is unchanged
Thanks all. This has been applied and will be in the 5.1 release as well as tomorrow's snapshop. Mass update RESOLVED->CLOSED after release of openssh-5.1 |