Created attachment 3499 [details] OpenSSH client strace output Hello, We encountered an issue with the ssh client (even version 8.5p1) where it tries to select() a closed file descriptor resulting in a failure and the control master socket is closed. The issue occurs when we connect to multiple target hosts (~ 100 hosts) through an SSH bastion server (using ProxyJump) and issue a command to each target host (Eg. 'id'). We consistently encounter the following error with one of the *read* file descriptors on a MUX channel: select: Bad file descriptor Tested the following versions on Debian 10 (identical results): OpenSSH 7.9p1 (latest Debian 10 package) OpenSHS 8.5p1 (github manual build) Client configuration: # Bastion: Persistent Socket and SOCKS Proxy Host my-bastion User myuser ProxyJump none ControlMaster auto ControlPersist 28800s ControlPath ~/.ssh/my-bastion.sock DynamicForward 127.0.0.1:1080 ExitOnForwardFailure yes HostName my-bastion1.mydomain.com # Jump via Bastion for those hosts Host *.mydomain.com ProxyJump my-bastion # Catch all Host * User root SendEnv LANG LC_* AddKeysToAgent yes ForwardAgent yes TCPKeepAlive yes ServerAliveCountMax 3 ServerAliveInterval 20 AddressFamily inet Build: (See openssh-build.txt attachment) Steps to reproduce: # Create a connection to the bastion (debug level 3 logging), exit (socket is still present on client), strace the ssh pid attached to the bastion socket on client host: ssh -vvv -E ssh.log my-bastion exit # myuser 14510 0.5 0.0 16256 2660 ? Ss 00:25 0:00 ssh: /home/myuser/.ssh/my-bastion.sock [mux] strace -f -s 2048 -o strace.txt -p 14510 # separate terminal ANSIBLE_SSH_ARGS= ansible -i my_target_hosts all -a id When the client attempts to select the closed file descriptor for a MUX channel, the end result is the control master socket is closed and unlinked. I will attach files for: * source locations of both the close() and select() * ssh logs * strace output Let me know if you need any additional info. Much appreciated,
Created attachment 3500 [details] OpenSSH client log
Created attachment 3501 [details] OpenSSH source files
Created attachment 3502 [details] OpenSSH build steps
There isn't quite enough debug output there to figure out what is going wrong and I'm not able to replicate this locally (w/ 40 concurrent jobs each making 100 multiplexed connections). Could you attach a complete client debug output (ssh -vvv ...) for both the main multiplex process and the failing passenger process? Likewise, more complete strace output would be helpful. Please use OpenSSH if possible as I just added a bit more debugging (commit f068930635) that might help figure out what is going wrong.
Thanks Damien, I will re-run the test with another build (using commit f068930635) and will try to collect & provide additional logging.
Created attachment 3515 [details] Full OpenSSH_8.6p1 MUX proc log
Created attachment 3516 [details] Full OpenSSH_8.6p1 ansible proc log
Apologies for the latent response, I am able to reproduce this issue on every attempt with OpenSSH 8.6p1 (commit f068930635). I have attached the full ssh log output for both the MUX process and the ansible / ssh processes running through the MUX connection to the bastion host. Full OpenSSH_8.6p1 MUX proc log Full OpenSSH_8.6p1 ansible proc log (gzip) I used the same steps outlined in the original comment with the exception where extra logging was enabled on the ansible side: ANSIBLE_SSH_ARGS="-vvv -E ./ssh.log" ansible -i my_target_hosts all -a id Let me know whether you would also need the full strace output or whether the logs above will suffice. Thanks
Created attachment 3517 [details] Full OpenSSH_8.6p1 MUX proc log
Created attachment 3518 [details] debug select failures Unfortunately, it's hard to figure out what is going on there without the actual bad file descriptor. Sorry to be a bother, but are you able to reproduce using git HEAD with this patch applied? It includes some extra debugging that will let us determine the sequence of events, and will log which file descriptors are bad after select fails.
Closing for lack of followup. OpenSSH HEAD has replaced the use of select() with poll(). Please try HEAD or OpenSSH 8.9 when it is released as it might fix the problem you're seeing. If not, then I recommend setting the DEBUG_CHANNEL_POLL #define at the start of channels.c and attaching the debug output. poll(2) is easier to debug than select(2), because it will tell you which fd is bad via POLLNVAL and we do log this information
closing bugs resolved before openssh-8.9