Bug 3304 - SSH client MUX to multiple hosts causes select: Bad file descriptor
Summary: SSH client MUX to multiple hosts causes select: Bad file descriptor
Status: CLOSED WORKSFORME
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: 8.5p1
Hardware: amd64 Linux
: P5 normal
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-24 11:20 AEST by E B
Modified: 2022-02-25 13:58 AEDT (History)
2 users (show)

See Also:


Attachments
OpenSSH client strace output (2.13 KB, text/plain)
2021-04-24 11:20 AEST, E B
no flags Details
OpenSSH client log (1.46 KB, text/plain)
2021-04-24 11:22 AEST, E B
no flags Details
OpenSSH source files (1.17 KB, text/plain)
2021-04-24 11:23 AEST, E B
no flags Details
OpenSSH build steps (718 bytes, text/plain)
2021-04-24 11:24 AEST, E B
no flags Details
Full OpenSSH_8.6p1 MUX proc log (515.18 KB, application/octet-stream)
2021-05-12 05:23 AEST, E B
no flags Details
Full OpenSSH_8.6p1 ansible proc log (42.41 KB, application/x-gzip)
2021-05-12 05:26 AEST, E B
no flags Details
Full OpenSSH_8.6p1 MUX proc log (515.18 KB, text/plain)
2021-05-12 05:50 AEST, E B
no flags Details
debug select failures (2.96 KB, patch)
2021-05-13 07:45 AEST, Damien Miller
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description E B 2021-04-24 11:20:27 AEST
Created attachment 3499 [details]
OpenSSH client strace output

Hello,

We encountered an issue with the ssh client (even version 8.5p1) where it tries to select() a closed file descriptor resulting in a failure and the control master socket is closed.  The issue occurs when we connect to multiple target hosts (~ 100 hosts) through an SSH bastion server (using ProxyJump) and issue a command to each target host (Eg. 'id'). We consistently encounter the following error with one of the *read* file descriptors on a MUX channel:

select: Bad file descriptor

Tested the following versions on Debian 10 (identical results):

OpenSSH 7.9p1 (latest Debian 10 package)
OpenSHS 8.5p1 (github manual build)

Client configuration:

# Bastion: Persistent Socket and SOCKS Proxy
Host my-bastion
    User myuser
    ProxyJump none
    ControlMaster auto
    ControlPersist 28800s
    ControlPath ~/.ssh/my-bastion.sock
    DynamicForward 127.0.0.1:1080
    ExitOnForwardFailure yes
    HostName my-bastion1.mydomain.com

# Jump via Bastion for those hosts
Host *.mydomain.com
    ProxyJump my-bastion

# Catch all
Host *
    User root
    SendEnv LANG LC_*
    AddKeysToAgent yes
    ForwardAgent yes
    TCPKeepAlive yes
    ServerAliveCountMax 3
    ServerAliveInterval 20
    AddressFamily inet


Build:

(See openssh-build.txt attachment)

Steps to reproduce:

# Create a connection to the bastion (debug level 3 logging), exit (socket is still present on client), strace the ssh pid attached to the bastion socket on client host:
ssh -vvv -E ssh.log my-bastion
exit
# myuser 14510  0.5  0.0  16256  2660 ?        Ss   00:25   0:00 ssh: /home/myuser/.ssh/my-bastion.sock [mux]
strace -f -s 2048 -o strace.txt -p 14510

# separate terminal
ANSIBLE_SSH_ARGS= ansible -i my_target_hosts all -a id


When the client attempts to select the closed file descriptor for a MUX channel, the end result is the control master socket is closed and unlinked. I will attach files for:

* source locations of both the close() and select()
* ssh logs
* strace output

Let me know if you need any additional info.
Much appreciated,
Comment 1 E B 2021-04-24 11:22:29 AEST
Created attachment 3500 [details]
OpenSSH client log
Comment 2 E B 2021-04-24 11:23:13 AEST
Created attachment 3501 [details]
OpenSSH source files
Comment 3 E B 2021-04-24 11:24:06 AEST
Created attachment 3502 [details]
OpenSSH build steps
Comment 4 Damien Miller 2021-04-30 15:14:28 AEST
There isn't quite enough debug output there to figure out what is going wrong and I'm not able to replicate this locally (w/ 40 concurrent jobs each making 100 multiplexed connections).

Could you attach a complete client debug output (ssh -vvv ...) for both the main multiplex process and the failing passenger process? Likewise, more complete strace output would be helpful.

Please use OpenSSH if possible as I just added a bit more debugging (commit f068930635) that might help figure out what is going wrong.
Comment 5 E B 2021-05-04 11:21:18 AEST
Thanks Damien, I will re-run the test with another build (using commit f068930635) and will try to collect & provide additional logging.
Comment 6 E B 2021-05-12 05:23:43 AEST
Created attachment 3515 [details]
Full OpenSSH_8.6p1 MUX proc log
Comment 7 E B 2021-05-12 05:26:01 AEST
Created attachment 3516 [details]
Full OpenSSH_8.6p1 ansible proc log
Comment 8 E B 2021-05-12 05:38:00 AEST
Apologies for the latent response, I am able to reproduce this issue on every attempt with OpenSSH 8.6p1 (commit f068930635).

I have attached the full ssh log output for both the MUX process and the ansible / ssh processes running through the MUX connection to the bastion host.

Full OpenSSH_8.6p1 MUX proc log
Full OpenSSH_8.6p1 ansible proc log (gzip)

I used the same steps outlined in the original comment with the exception where extra logging was enabled on the ansible side:

ANSIBLE_SSH_ARGS="-vvv -E ./ssh.log" ansible -i my_target_hosts all -a id

Let me know whether you would also need the full strace output or whether the logs above will suffice.

Thanks
Comment 9 E B 2021-05-12 05:50:41 AEST
Created attachment 3517 [details]
Full OpenSSH_8.6p1 MUX proc log
Comment 10 Damien Miller 2021-05-13 07:45:54 AEST
Created attachment 3518 [details]
debug select failures

Unfortunately, it's hard to figure out what is going on there without the actual bad file descriptor. Sorry to be a bother, but are you able to reproduce using git HEAD with this patch applied?

It includes some extra debugging that will let us determine the sequence of events, and will log which file descriptors are bad after select fails.
Comment 11 Damien Miller 2022-01-14 15:16:11 AEDT
Closing for lack of followup.

OpenSSH HEAD has replaced the use of select() with poll(). Please try HEAD or OpenSSH 8.9 when it is released as it might fix the problem you're seeing.

If not, then I recommend setting the DEBUG_CHANNEL_POLL #define at the start of channels.c and attaching the debug output. poll(2) is easier to debug than select(2), because it will tell you which fd is bad via POLLNVAL and we do log this information
Comment 12 Damien Miller 2022-02-25 13:58:31 AEDT
closing bugs resolved before openssh-8.9