Scenario: 1. Set up a local socket server that sends data slowly enough so that buffers would take hours to fill up: $ (until false; do echo -n X; sleep 2; done) | nc -l 8000 & 2. Connect through an unreliable connection, asking to detect a broken connection within 10 seconds (5 second "alive" signals, 2 missing maximum) $ ssh -R 8001:127.0.0.1:8000 \ -o 'ServerAliveInterval 5' -o 'ServerAliveCountMax 2' \ -o 'ProxyCommand nc 127.0.0.1 22' \ 127.0.0.1 'telnet 127.0.0.1 8001' (this assumes you can ssh into localhost using either a password or public key authentication) 3. Observe that indeed, you are getting 'X' printed every 2 seconds, through the ssh tunnel. 4. Suspend the intermediate proxy - in another terminal / screen session (or after backgrounding the ssh command above), do: $ pkill -STOP -xf 'nc 127.0.0.1 22' 5. Wait 10 seconds for ServerAlive detection to kick in. Or 10 hours. ServerAlive detection never actually kicks in. 6. Tear down everything (it is enough to Ctrl-C the ssh command) 7. Repeat steps 1-5, this time, with 'sleep 2' replaced by 'sleep 30'. This time, ServerAlive detection kicks in as expected. This happens on every openssh version I've tried (All on linux, the versions on ubuntu 8.04, 10.04, 10.10, 12.04, 14.04), and is still in current from browsing the source code. The problem is the "ServerAlive" logic (and I assume, also the ClientAlive logic on the server side - though I haven't verified that yet): A connection is deemed "alive" if the select() waiting for data did not time out. However, it should be deemed alive only if there has been data on the ssh connection itself - not the local ends of a -L / -R tunnel and whatever other local sockets might be waited upon with select(). As the above example shows, even though the connection to the server is effectively dead, it will not be detected. This setup is artificial, and is easier to debug than a real world setting. It includes: - the ssh server - an intermediate pipe ('nc 127.0.0.1 22') that can be kill -STOPped without dropping the connection - the ssh client - a slow server that trickles data through a tunnel In a real world scenario, the intermediate pipe is likely to be an unreliable network connection (e.g. an intermediate router somewhere along the way that is not directly connected to a client interface - and that stops routing traffic in the middle of the session). If this is the case, then eventually the ssh client will have a TCP timeout (2 mins, usually) and detect the broken connection -- which is why I suppose this was not previously reported. However, if there is no indication the intermediate connection died (like in the example I gave above), then the ssh client will hang forever, despite the "ServerAlive*" settings. As I mentioned, this likely applies to the sshd, ClientAliveInterval, ClientAliveCountMax respectively, though I haven't verified it.
Note that in some circumstances this can be leveraged into a denial-of-service attack - if an attacker is able to disconnect a remote connection and feed data locally at the same time, they can avoid new data coming in. (I found this out while investigating what looked like a DOS but eventually wasn't)
The patch sent to the mailing list here: https://lists.mindrot.org/pipermail/openssh-unix-dev/2020-May/038522.html ...will fix this issue. However, the patch is currently in limbo, neither accepted nor rejected.
Created attachment 3417 [details] ServerAliveInterval doesn't work if client keeps trying to send data Patch in question for commenting.
Comment on attachment 3417 [details] ServerAliveInterval doesn't work if client keeps trying to send data Looks mostly ok, there's a couple of long lines and one comment: >+ timeout_secs = server_alive_time - now; >+ if (timeout_secs < 0) >+ timeout_secs = 0; This can be a MAXIMUM(..) which is shorter and consistent with the rest of the code. I'll attach an updated patch shortly.
Created attachment 3419 [details] Make ServerAlive behave correctly during client port forward activity
Created attachment 3420 [details] Move the ServerAlive scheduling into a helper function. To me this is a bit easier to read.
Created attachment 3421 [details] Move the ServerAlive scheduling into a helper function. fix typo
(modified) patch applied and and will be in the 8.4 release. Thanks for the report and patch.
close bugs that were resolved in OpenSSH 8.5 release cycle