Bug 3405 - clientloop's client_wait_until_can_do_something uses 100 % CPU with ssh 2> >({exec 1>&2})
Summary: clientloop's client_wait_until_can_do_something uses 100 % CPU with ssh 2> >(...
Status: CLOSED FIXED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: 8.9p1
Hardware: amd64 Linux
: P5 minor
Assignee: Damien Miller
URL:
Keywords:
: 3411 (view as bug list)
Depends on:
Blocks: V_9_0
  Show dependency treegraph
 
Reported: 2022-03-14 22:18 AEDT by Nathan Monfils
Modified: 2022-04-08 12:12 AEST (History)
2 users (show)

See Also:


Attachments
avoid polling fds we're not ready to service events on (9.79 KB, patch)
2022-03-15 17:31 AEDT, Damien Miller
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Nathan Monfils 2022-03-14 22:18:10 AEDT
Hi,

This is a bit of an edge-case (and I've already found a workaround using `exec tee`), but since updating to the latest release I've had 100 % CPU usage on a script, which you can replicate with `ssh <host> 2> >({exec 1>&2})`. (I'm actually redirecting stderr to a shell function that parses it for a while then gives up on it by doing `exec 1>&2`).

After a few seconds, the ssh process starts using 100 % CPU. Using GDB, I consistently get the following trace:

<SNIP>
#1  0x0000556af8a302e1 in poll (__timeout=<optimized out>, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/bits/poll2.h:39
#2  client_wait_until_can_do_something (conn_out_readyp=<synthetic pointer>, conn_in_readyp=<synthetic pointer>, 
    rekeying=<optimized out>, npfd_activep=0x7ffe8c32d07c, npfd_allocp=0x7ffe8c32d078, pfdp=0x7ffe8c32d080, ssh=0x556af964bf80)
    at clientloop.c:575
<SNIP>

I'm guessing this is linked to OpenBSD-Commit a77e16a667d5b194dcdb3b76308b8bba7fa7239c "upstream: convert ssh, sshd mainloops from select() to poll();".
Comment 1 Damien Miller 2022-03-15 10:00:43 AEDT
I can't get this to replicate. What's the full command you're using? `ssh <host> 2> >({exec 1>&2})` doesn't include a command that makes any output
Comment 2 Damien Miller 2022-03-15 17:31:00 AEDT
Created attachment 3581 [details]
avoid polling fds we're not ready to service events on

I did get it to reproduce, I was just holding it wrong.

Thanks for the report and the reproduction instructions. Here's my analysis and a fix is attached:

> I think the sequence of operations is something like:
> 
> 1. Channel opens, stderr is attached
> 2. Stderr goes away
> 3. We poll with pfd[stderr].events including POLLOUT
> 4. We get back POLLHUP (POLLERR on Linux)
> 5. channel_handle_efd_write() sees sshbuf_len(c->extended)==0, returns
> 6. GOTO 3 forever
> 
> One problem is that there is no way to propagate a POLLHUP condition
> back to a writable channel when there is no output pending.
> 
> A more fundamental problem is step 3, where we unconditionally included
> the fd in the poll array, regardless of whether the channel code had any
> intention of attempting a write() later. When I was doing the conversion
> from select(), I kept the fds in there because it made the matching of
> pfd entries and channels easier but this was probably a mistake.
> 
> The patch corrects the mistake (hopefully). It will only
> set up a pollfd entry if IO was requested for the fd, and as a
> consequence, should avoid spurious events. It also forced me to make
> the pollfd bookkeeping less brittle :)
> 
> PS. an alternate approach would be to leave the pollfd entries in the
> array and find a way to handle POLLERR/POLLHUP events. I have a diff
> that implements this approach too, by treating such a condition on
> a channel that wasn't ready to write something as an automatic write
> failure. I shelved this in favour of this approach.
Comment 3 Damien Miller 2022-03-31 08:26:54 AEDT
*** Bug 3411 has been marked as a duplicate of this bug. ***
Comment 4 Damien Miller 2022-03-31 08:27:53 AEDT
This has been fixed in git head and will be in the openssh-9.0 release that is due very soon.

commit d6556de1db0822c76ba2745cf5c097d9472adf7c
Author: djm@openbsd.org <djm@openbsd.org>
Date:   Wed Mar 30 21:10:25 2022 +0000

    upstream: fix poll() spin when a channel's output fd closes without
    
    data in the channel buffer. Introduce more exact packing of channel fds into
    the pollfd array. fixes bz3405 and bz3411; ok deraadt@ markus@
    
    OpenBSD-Commit-ID: 06740737849c9047785622ad5d472cb6a3907d10
Comment 5 Damien Miller 2022-04-08 12:12:52 AEST
closing bug resolved during openssh-9.0 release cycle