Bug 2071

Summary: sshd closes stderr but not stdout when child process exits
Product: Portable OpenSSH Reporter: JD Paul <jdpaul>
Component: sshdAssignee: Damien Miller <djm>
Status: CLOSED FIXED    
Severity: normal CC: ahmedsayeed1982, djm, dtucker, jjelen
Priority: P5    
Version: 6.1p1   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 2915    
Attachments:
Description Flags
strace output and code fragments demonstrating stderr-vs-stdout bug
none
Copy close-deferral logic for efd from rfd dtucker: ok+

Description JD Paul 2013-02-16 02:50:32 AEDT
Created attachment 2220 [details]
strace output and code fragments demonstrating stderr-vs-stdout bug

Description
-----------

When using ssh to run a remote command which forks a child process and
then exits, the behavior of sshd differs for stdout and stderr: stdout
stays connected, but stderr gets closed.

This behavior occurs for Protocol 2 on OpenSSH 4.6p1 and later, on any OS.

For Protocol 2 and OpenSSH 4.5p1 and earlier, stdout and stderr both
stay connected.

For Protocol 1 on any version, stdout and stderr both stay connected.


The effect of the stderr close is that any process on the server side
that tries to write to stderr, once sshd closes its end of the stderr
pipe, will get a SIGPIPE signal.  This will in most cases kill the
process tree (either directly or via cascading errors).

This is a subtle problem (and can be rare and hard to reproduce, since
it depends on something writing to stderr), but it can be easily
reproduced.


Demonstration
-------------

NOTE: these examples work only with a remote user with a Bourne shell
(sh, ksh, bash, or similar).  A shell with a built-in 'echo' (bash or
ksh) will give the cleanest demonstration.


1.  stdout only:  stdout stays connected after primary command exit:

CMD="echo parent ; ( echo child stdout 1 ; sleep 2 ; echo child stdout 2; sleep 2 ; echo child stdout 3 ) & echo parent end ; exit 0"

ssh -2 $testhost "$CMD" 2>&1 |
    while read line ; do date "+%Y%m%d %T.%3N $line" ; done

This emits output like this:

20130206 15:35:33.743 parent
20130206 15:35:33.751 parent end
20130206 15:35:33.752 child stdout 1
20130206 15:35:35.742 child stdout 2
20130206 15:35:37.768 child stdout 3


2.  Add writes to stderr:  this will cause the connection to break.
The time of the break is dependent on which completes first, the
parent shell exit or the child shell's write to stderr.  This is a
classic race condition, and is dependent on the CPU speed of the
remote host.

CMD="echo parent ; ( echo child stdout 1 ; echo child stderr 1 1>&2 ; sleep 2 ; echo child stdout 2 ; echo child stderr 2 1>&2 ; sleep 2 ; echo child stdout 3; echo child stderr 3 1>&2 ) & echo parent end ; exit 0"

ssh -2 $testhost "$CMD" 2>&1 |
    while read line ; do date "+%Y%m%d %T.%3N $line" ; done

For affected versions of OpenSSH, this emits output like this:

20130206 15:37:43.406 parent
20130206 15:37:43.408 child stdout 1
20130206 15:37:43.410 child stderr 1
20130206 15:37:43.412 parent end
20130206 15:37:45.406 child stdout 2

Slower hosts might break the connection on the first write to stderr
('echo child stderr 1 1>&2'), while faster hosts usually break the
connection on the second ('echo child stderr 2 1>&2').

For unaffected versions of OpenSSH, it emits the full output:

20130206 16:10:44.418 parent
20130206 16:10:44.420 child stdout 1
20130206 16:10:44.551 parent end
20130206 16:10:44.553 child stderr 1
20130206 16:10:46.422 child stdout 2
20130206 16:10:46.424 child stderr 2
20130206 16:10:48.425 child stdout 3
20130206 16:10:48.427 child stderr 3


Software Versions
-----------------

I tested a range of OpenSSH versions, each compiled from source, to
figure out when the bug was introduced.

The bug is not present in OpenSSH-4.5p1 and earlier.

The bug is present in OpenSSH-4.6p1 and later, all the way through
OpenSSH-6.1p1.

(Confusingly, the bug IS present on RedHat Enterprise/CentOS 5.4,
which include OpenSSH-4.3p2 via the openssh-4.3p2-36.el5_4.2 RPM;
RedHat probably back-ported changes without changing the version
number.)


Analysis
--------

strace output and code fragments are included in the attachment.


Thanks -- 

JD Paul
jdpaul@interstel.net
Comment 1 Darren Tucker 2013-05-10 14:08:49 AEST
thanks, we'll take a look at this for 6.3.
Comment 2 Damien Miller 2013-07-25 12:17:58 AEST
Retarget to openssh-6.4
Comment 3 Damien Miller 2013-07-25 12:20:54 AEST
Retarget 6.3 -> 6.4
Comment 4 Damien Miller 2014-02-06 10:18:07 AEDT
Retarget incomplete bugs / feature requests to 6.6 release
Comment 5 Damien Miller 2014-02-06 10:20:13 AEDT
Retarget incomplete bugs / feature requests to 6.6 release
Comment 6 Damien Miller 2014-04-12 14:48:58 AEST
Retarget to 6.7 release, since 6.6 was mostly bugfixing.
Comment 7 Damien Miller 2014-04-12 14:54:20 AEST
Remove from 6.6 tracking bug
Comment 8 Damien Miller 2014-08-30 04:38:56 AEST
Retarget incomplete bugs to 6.8 release.
Comment 9 Damien Miller 2014-08-30 04:39:51 AEST
These bugs are no longer targeted at the imminent 6.7 release
Comment 10 Damien Miller 2015-03-03 07:59:32 AEDT
OpenSSH 6.8 is approaching release and closed for major work. Retarget these bugs for the next release.
Comment 11 Damien Miller 2015-03-03 08:01:12 AEDT
Retarget to 6.9
Comment 12 Damien Miller 2015-06-05 14:17:32 AEST
AFAIK, some of this behaviour was the workaround on bug 52.
Comment 13 Jakub Jelen 2016-05-12 01:02:52 AEST
Yes, the issue is certainly related to the bug #52, but I don't think it is caused by the workaround for this bug. The patch for the above mentioned bug modifies handling of output and error streams, but each of them different way. The stdout is blocked without TTY but the stderr is not.

I spent some time with this case and finally ended up with the patch below. 
I was unable to reproduce the problems from the previous bug nor from this one. Yes, it is basically the same hack for the seconds stream, but probably better than noting and running into another weird and unexpected SIGPIPE exits.

Also running regress tests passed for me just fine with 6.6p1 version, but this part didn't see many changes since then so git head should be good too (still running some tests).

diff --git a/channels.c b/channels.c
index 7ee1f98..2601ad5 100644
--- a/channels.c
+++ b/channels.c
@@ -1835,7 +1835,7 @@ channel_handle_efd(Channel *c, fd_set *readset, fd_set *writeset)
 			debug2("channel %d: read %d from efd %d",
 			    c->self, len, c->efd);
 			if (len < 0 && (errno == EINTR || ((errno == EAGAIN ||
-			    errno == EWOULDBLOCK) && !c->detach_close)))
+			    errno == EWOULDBLOCK) && !(c->isatty && c->detach_close))))
 				return 1;
 			if (len <= 0) {
 				debug2("channel %d: closing read-efd %d",
Comment 14 Jakub Jelen 2016-08-03 16:07:40 AEST
Damien, any chance to get this fixed upstream? We got confirmation that the patch from last comment works and solves the problem of closing stderr.
Comment 15 Damien Miller 2019-02-08 14:15:27 AEDT
Created attachment 3240 [details]
Copy close-deferral logic for efd from rfd

IMO this is the best way to fix it - this copies the conditions that we use for stdout for stderr. Relative to Jakub's patch, it additionally tests the channel state.

Seems to pass the test that JD Paul kindly shared.
Comment 16 Damien Miller 2019-02-08 14:52:58 AEDT
This has been applied - thanks again for the detailed analysis and sorry that this slipped my radar for so long.

This should be in the OpenSSH 8.0 release.
Comment 17 Damien Miller 2021-03-04 09:54:03 AEDT
close bugs that were resolved in OpenSSH 8.5 release cycle
Comment 18 Ahmed Sayeed 2021-10-14 01:40:27 AEDT
[spam removed]