Bug 3445 - ssh -D leaks file descriptors until new connections fail
Summary: ssh -D leaks file descriptors until new connections fail
Status: NEW
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: 9.0p1
Hardware: amd64 Linux
: P5 normal
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-06-12 07:53 AEST by hlein
Modified: 2022-07-11 03:30 AEST (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description hlein 2022-06-12 07:53:27 AEST
It seems that recent ssh's socks proxy leaks file descriptors / sockets stuck in FIN_WAIT2 state. I first noticed this with 9.0p1 client talking to 8.9p1 server, reproduced after upgrading the server to 9.0p1, currently testing client downgraded to 8.9p1 but it looks the same.

I have an ssh -D listener that is used both by browsers to reach internal webservers, and by ssh (via ProxyCommand=nc -X ...).

After some time (days?) new SOCKS-proxied TCP connections will start to fail - the local listener will still respond to a SYN, but never passes anything through.

Existing proxied connections will still pass traffic.

If the proxying SSH client was not backgrounded, it will start spitting out:
accept: Too many open files
accept: Too many open files
accept: Too many open files
accept: Too many open files

The client sits at 100% CPU, and when strace'ing it, I see a busy-loop of poll and getpid. Makes me suspect the changes from select->poll in ~8.9 (https://marc.info/?l=openssh-unix-dev&m=164151015729522&w=4)

The client has accumulated a bunch of file descriptors:
# ls -l /proc/5338/fd/ | wc -l
1025

And a bunch of sockets in FIN_WAIT2:
# netstat -antp | awk '/5338/{print $6}' | sort | uniq -c
      4 ESTABLISHED
   1015 FIN_WAIT2
      2 LISTEN

Meanwhile on the server:
 # netstat -antp | awk '/27472/{print $6}' | sort | uniq -c
   1015 CLOSE_WAIT
      3 ESTABLISHED
Comment 1 hlein 2022-06-13 04:26:21 AEST
Confirmed 8.9p1 client -> 9.0p1 server has the same problem.
Comment 2 Damien Miller 2022-06-24 13:30:54 AEST
I'm not able to replicate this using openssh HEAD (basically 9.0) on the client and 8.8p1 on the server.

Using:

$ ulimit -n 32 ; ./ssh -vvvFnone -D1080 ::1

to start the SOCKS listener and

$ for x in `seq 1 256` ; do nc -X 5 -x 127.0.0.1:1080 127.0.0.1 22 </dev/null | head -n 1 ; done

To make sure it is disposing of the fds correctly. It seems to be - there is no accumulation of connections in FIN_WAIT2 from this.

Next, I tested creating more connections than open fds using

$ sh -c 'ulimit -n 16 ; ./ssh -vvvFnone -D1080 ::1'

and running

$ nc -X 5 -x 127.0.0.1:1080 127.0.0.1 22 &

until no more connections were accepted.

At this point, ssh implements an accept hold-off that prevents hard spinning on accept():

accept: Too many open files
debug3: channel_handler: chan 1: skip for 1 more seconds
debug3: channel_handler: first channel unpauses in 1 seconds
debug1: Connection to port 1080 forwarding to socks port 0 requested.

So it does call accept() at 1 Hz, but it shouldn't (and doesn't in my tests) use 100% CPU.

To investigate this further, we will need a debug log (ssh -vvv) from misbehaving client.
Comment 3 hlein 2022-07-11 03:30:22 AEST
Thanks for looking into it!

I have noticed this mostly occurs for me when I have a browser tab open to a particularly chatty internal webapp. Which I've been blissfully able to not have open for a while now. Next time I'm forced to use it I'll see if I can gather more useful troubleshooting info.