Created attachment 2220 [details] strace output and code fragments demonstrating stderr-vs-stdout bug Description ----------- When using ssh to run a remote command which forks a child process and then exits, the behavior of sshd differs for stdout and stderr: stdout stays connected, but stderr gets closed. This behavior occurs for Protocol 2 on OpenSSH 4.6p1 and later, on any OS. For Protocol 2 and OpenSSH 4.5p1 and earlier, stdout and stderr both stay connected. For Protocol 1 on any version, stdout and stderr both stay connected. The effect of the stderr close is that any process on the server side that tries to write to stderr, once sshd closes its end of the stderr pipe, will get a SIGPIPE signal. This will in most cases kill the process tree (either directly or via cascading errors). This is a subtle problem (and can be rare and hard to reproduce, since it depends on something writing to stderr), but it can be easily reproduced. Demonstration ------------- NOTE: these examples work only with a remote user with a Bourne shell (sh, ksh, bash, or similar). A shell with a built-in 'echo' (bash or ksh) will give the cleanest demonstration. 1. stdout only: stdout stays connected after primary command exit: CMD="echo parent ; ( echo child stdout 1 ; sleep 2 ; echo child stdout 2; sleep 2 ; echo child stdout 3 ) & echo parent end ; exit 0" ssh -2 $testhost "$CMD" 2>&1 | while read line ; do date "+%Y%m%d %T.%3N $line" ; done This emits output like this: 20130206 15:35:33.743 parent 20130206 15:35:33.751 parent end 20130206 15:35:33.752 child stdout 1 20130206 15:35:35.742 child stdout 2 20130206 15:35:37.768 child stdout 3 2. Add writes to stderr: this will cause the connection to break. The time of the break is dependent on which completes first, the parent shell exit or the child shell's write to stderr. This is a classic race condition, and is dependent on the CPU speed of the remote host. CMD="echo parent ; ( echo child stdout 1 ; echo child stderr 1 1>&2 ; sleep 2 ; echo child stdout 2 ; echo child stderr 2 1>&2 ; sleep 2 ; echo child stdout 3; echo child stderr 3 1>&2 ) & echo parent end ; exit 0" ssh -2 $testhost "$CMD" 2>&1 | while read line ; do date "+%Y%m%d %T.%3N $line" ; done For affected versions of OpenSSH, this emits output like this: 20130206 15:37:43.406 parent 20130206 15:37:43.408 child stdout 1 20130206 15:37:43.410 child stderr 1 20130206 15:37:43.412 parent end 20130206 15:37:45.406 child stdout 2 Slower hosts might break the connection on the first write to stderr ('echo child stderr 1 1>&2'), while faster hosts usually break the connection on the second ('echo child stderr 2 1>&2'). For unaffected versions of OpenSSH, it emits the full output: 20130206 16:10:44.418 parent 20130206 16:10:44.420 child stdout 1 20130206 16:10:44.551 parent end 20130206 16:10:44.553 child stderr 1 20130206 16:10:46.422 child stdout 2 20130206 16:10:46.424 child stderr 2 20130206 16:10:48.425 child stdout 3 20130206 16:10:48.427 child stderr 3 Software Versions ----------------- I tested a range of OpenSSH versions, each compiled from source, to figure out when the bug was introduced. The bug is not present in OpenSSH-4.5p1 and earlier. The bug is present in OpenSSH-4.6p1 and later, all the way through OpenSSH-6.1p1. (Confusingly, the bug IS present on RedHat Enterprise/CentOS 5.4, which include OpenSSH-4.3p2 via the openssh-4.3p2-36.el5_4.2 RPM; RedHat probably back-ported changes without changing the version number.) Analysis -------- strace output and code fragments are included in the attachment. Thanks -- JD Paul jdpaul@interstel.net
thanks, we'll take a look at this for 6.3.
Retarget to openssh-6.4
Retarget 6.3 -> 6.4
Retarget incomplete bugs / feature requests to 6.6 release
Retarget to 6.7 release, since 6.6 was mostly bugfixing.
Remove from 6.6 tracking bug
Retarget incomplete bugs to 6.8 release.
These bugs are no longer targeted at the imminent 6.7 release
OpenSSH 6.8 is approaching release and closed for major work. Retarget these bugs for the next release.
Retarget to 6.9
AFAIK, some of this behaviour was the workaround on bug 52.
Yes, the issue is certainly related to the bug #52, but I don't think it is caused by the workaround for this bug. The patch for the above mentioned bug modifies handling of output and error streams, but each of them different way. The stdout is blocked without TTY but the stderr is not. I spent some time with this case and finally ended up with the patch below. I was unable to reproduce the problems from the previous bug nor from this one. Yes, it is basically the same hack for the seconds stream, but probably better than noting and running into another weird and unexpected SIGPIPE exits. Also running regress tests passed for me just fine with 6.6p1 version, but this part didn't see many changes since then so git head should be good too (still running some tests). diff --git a/channels.c b/channels.c index 7ee1f98..2601ad5 100644 --- a/channels.c +++ b/channels.c @@ -1835,7 +1835,7 @@ channel_handle_efd(Channel *c, fd_set *readset, fd_set *writeset) debug2("channel %d: read %d from efd %d", c->self, len, c->efd); if (len < 0 && (errno == EINTR || ((errno == EAGAIN || - errno == EWOULDBLOCK) && !c->detach_close))) + errno == EWOULDBLOCK) && !(c->isatty && c->detach_close)))) return 1; if (len <= 0) { debug2("channel %d: closing read-efd %d",
Damien, any chance to get this fixed upstream? We got confirmation that the patch from last comment works and solves the problem of closing stderr.
Created attachment 3240 [details] Copy close-deferral logic for efd from rfd IMO this is the best way to fix it - this copies the conditions that we use for stdout for stderr. Relative to Jakub's patch, it additionally tests the channel state. Seems to pass the test that JD Paul kindly shared.
This has been applied - thanks again for the detailed analysis and sorry that this slipped my radar for so long. This should be in the OpenSSH 8.0 release.
close bugs that were resolved in OpenSSH 8.5 release cycle
[spam removed]