It seems that recent ssh's socks proxy leaks file descriptors / sockets stuck in FIN_WAIT2 state. I first noticed this with 9.0p1 client talking to 8.9p1 server, reproduced after upgrading the server to 9.0p1, currently testing client downgraded to 8.9p1 but it looks the same. I have an ssh -D listener that is used both by browsers to reach internal webservers, and by ssh (via ProxyCommand=nc -X ...). After some time (days?) new SOCKS-proxied TCP connections will start to fail - the local listener will still respond to a SYN, but never passes anything through. Existing proxied connections will still pass traffic. If the proxying SSH client was not backgrounded, it will start spitting out: accept: Too many open files accept: Too many open files accept: Too many open files accept: Too many open files The client sits at 100% CPU, and when strace'ing it, I see a busy-loop of poll and getpid. Makes me suspect the changes from select->poll in ~8.9 (https://marc.info/?l=openssh-unix-dev&m=164151015729522&w=4) The client has accumulated a bunch of file descriptors: # ls -l /proc/5338/fd/ | wc -l 1025 And a bunch of sockets in FIN_WAIT2: # netstat -antp | awk '/5338/{print $6}' | sort | uniq -c 4 ESTABLISHED 1015 FIN_WAIT2 2 LISTEN Meanwhile on the server: # netstat -antp | awk '/27472/{print $6}' | sort | uniq -c 1015 CLOSE_WAIT 3 ESTABLISHED
Confirmed 8.9p1 client -> 9.0p1 server has the same problem.
I'm not able to replicate this using openssh HEAD (basically 9.0) on the client and 8.8p1 on the server. Using: $ ulimit -n 32 ; ./ssh -vvvFnone -D1080 ::1 to start the SOCKS listener and $ for x in `seq 1 256` ; do nc -X 5 -x 127.0.0.1:1080 127.0.0.1 22 </dev/null | head -n 1 ; done To make sure it is disposing of the fds correctly. It seems to be - there is no accumulation of connections in FIN_WAIT2 from this. Next, I tested creating more connections than open fds using $ sh -c 'ulimit -n 16 ; ./ssh -vvvFnone -D1080 ::1' and running $ nc -X 5 -x 127.0.0.1:1080 127.0.0.1 22 & until no more connections were accepted. At this point, ssh implements an accept hold-off that prevents hard spinning on accept(): accept: Too many open files debug3: channel_handler: chan 1: skip for 1 more seconds debug3: channel_handler: first channel unpauses in 1 seconds debug1: Connection to port 1080 forwarding to socks port 0 requested. So it does call accept() at 1 Hz, but it shouldn't (and doesn't in my tests) use 100% CPU. To investigate this further, we will need a debug log (ssh -vvv) from misbehaving client.
Thanks for looking into it! I have noticed this mostly occurs for me when I have a browser tab open to a particularly chatty internal webapp. Which I've been blissfully able to not have open for a while now. Next time I'm forced to use it I'll see if I can gather more useful troubleshooting info.