ssh hangs on exit after running commands that fork, ie. daemons etc. I've experienced this on SunOS 5.6, 5.7, 5.8; OSF1 V4.0f, V5.1, HP-UX 10.x, 11.x and IRIX64 v6.5. Below is two truss outputs showing and normal exit and a hung exit. ================================================ NORMAL EXIT ================================================ poll(0xFFBEF670, 2, -1) = 1 fd=6 ev=POLLRDNORM rev=0 fd=7 ev=POLLRDNORM rev=POLLRDNORM read(7, "04", 16384) = 1 poll(0xFFBEF670, 2, -1) = 1 fd=6 ev=POLLOUT|POLLRDNORM rev=POLLOUT fd=7 ev=POLLRDNORM rev=0 write(6, "D4 Y ;E2D6D7 L15 O tB3\0".., 48) = 48 poll(0xFFBEF670, 2, -1) = 1 fd=6 ev=POLLRDNORM rev=POLLRDNORM fd=7 ev=POLLRDNORM rev=0 read(6, "D2F980A3 K171ECC 'CF19\f".., 8192) = 96 poll(0xFFBEF670, 3, -1) = 1 fd=6 ev=POLLRDNORM rev=0 fd=7 ev=POLLRDNORM rev=0 fd=8 ev=POLLOUT rev=POLLOUT write(8, " ^ D\b\b l o g o u t\r\n", 12) = 12 poll(0xFFBEF670, 2, -1) = 1 fd=6 ev=POLLRDNORM rev=POLLRDNORM fd=7 ev=POLLRDNORM rev=0 read(6, "ED ,8C QF11BD3 596D0A58B".., 8192) = 32 close(8) = 0 poll(0xFFBEF670, 2, -1) = 1 fd=6 ev=POLLRDNORM rev=POLLRDNORM fd=7 ev=POLLRDNORM rev=0 read(6, "1CD2DC n8C LD6 ]D5EBD806".., 8192) = 96 close(7) = 0 ioctl(0, TCSETSW, 0x000EDB64) = 0 iflag=0022406 oflag=0000005 cflag=03600277 lflag=0105073 cc: 003 034 177 025 004 000 000 000 021 023 032 031 022 017 027 026 000 000 000 close(9) = 0 poll(0xFFBEF670, 1, -1) = 1 fd=6 ev=POLLOUT|POLLRDNORM rev=POLLOUT write(6, " !80 ;E898 ;D6DFFB16 b .".., 32) = 32 sigaction(SIGWINCH, 0xFFBEF6F0, 0xFFBEF770) = 0 new: hand = 0x00000000 mask = 0 0 0 0 flags = 0x0012 old: hand = 0xFF21AE6C mask = 0 0 0 0 flags = 0x0012 write(2, " C o n n e c t i o n t".., 38) = 38 shutdown(6, 2, 1) = 0 close(6) = 0 setuid(100) = 0 door_info(4, 0xFFBEED60) = 0 door_call(4, 0xFFBEED48) = 0 mkdir("/home/swalton/.ssh", 0700) Err#17 EEXIST getpid() = 1779 [1573] getpid() = 1779 [1573] lstat64("/home/swalton/.ssh/prng_seed", 0xFFBEF170) = 0 d=0x0080000F i=124161 m=0100600 l=1 u=100 g=1 sz=1024 at = Dec 18 12:36:10 EST 2001 [ 1008696970 ] mt = Dec 18 12:36:06 EST 2001 [ 1008696966 ] ct = Dec 18 12:36:06 EST 2001 [ 1008696966 ] bsz=8192 blks=2 fs=ufs open64("/home/swalton/.ssh/prng_seed", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 6 write(6, " 8E515A5 y96 QD5F4C8 r19".., 1024) = 1024 close(6) = 0 llseek(0, 0, SEEK_CUR) = 40063 llseek(3, 0, SEEK_CUR) = 1770 _exit(0) =================================== HANG =================================== poll(0xFFBEF670, 2, -1) = 1 fd=6 ev=POLLRDNORM rev=0 fd=7 ev=POLLRDNORM rev=POLLRDNORM read(7, "04", 16384) = 1 poll(0xFFBEF670, 2, -1) = 1 fd=6 ev=POLLOUT|POLLRDNORM rev=POLLOUT fd=7 ev=POLLRDNORM rev=0 write(6, 0x00108A28, 48) = 48 4F8 Z A9A | 7E5A0BB 4DFDDA3C1FB v19 pE2 !A2 t83 bE0\n QBC 6 W ~ |98 -EF\n O\tFDD805 313 CB1 9 5 poll(0xFFBEF670, 2, -1) = 1 fd=6 ev=POLLRDNORM rev=POLLRDNORM fd=7 ev=POLLRDNORM rev=0 read(6, 0xFFBED7A0, 8192) = 48 B8BF05 9 'C5 n `80 I83AA , 686 JD9A3F707CEFA01 j h Y J rA001EA84 K 6 c8706B9ECCA = rB1 , 3 w I poll(0xFFBEF670, 3, -1) = 2 fd=6 ev=POLLRDNORM rev=POLLRDNORM fd=7 ev=POLLRDNORM rev=0 fd=8 ev=POLLOUT rev=POLLOUT write(8, " ^ D\b\b", 4) = 4 read(6, 0xFFBED7A0, 8192) = 48 86AA oCCB08284A1EC9AC6A7BB ^A01196FC14E4E6 \1101ABDF8A & y / & m 9FEBCC -B7FBBEDA z93C09095F4B581 poll(0xFFBEF670, 3, -1) = 1 fd=6 ev=POLLRDNORM rev=0 fd=7 ev=POLLRDNORM rev=0 fd=8 ev=POLLOUT rev=POLLOUT write(8, " l o g o u t\r\n", 8) = 8 poll(0xFFBEF670, 2, -1) = 1 fd=6 ev=POLLRDNORM rev=POLLRDNORM fd=7 ev=POLLRDNORM rev=0 read(6, 0xFFBED7A0, 8192) = 64 BD9988A3 T\n9BEDD1 M96D396 ` : N06 $030F0F\0BF ,D0E295 @FC {8FEE 14DEB7 =0F g Z } q GA9BD8B13 ,18A90FD1CBFA V1AA380B6CDC805E5881E poll(0xFFBEF670, 2, -1) (sleeping...) fd=6 ev=POLLRDNORM rev=POLLRDNORM fd=7 ev=POLLRDNORM rev=0
This is a known issue and has been discussed extensively on the mailings list. There is difference of opinion as to whether this is a bug or whether it is correct behaviour. A good description of the issue comes from a text file in Redhat's source RPM: ------------------------------------ About the hang-on-exit bug: this is the TODO item which shows up when you run "ssh server 'sleep 20 & exit'". Feel free to add the text directly to TODO (I didn't, but only because it's a lot of detail that might not belong). * The shell starts up, and starts its own session. As a side-effect, it gets its own process group. * The child forks off sleep, and because it's in the background, puts it into its own process group. The sleep command inherits a copy of the shell's descriptor for the tty as its stdout. * The shell exits, but doesn't SIGHUP all of its child PIDs like it probably should. * The sshd server attempts to read from the master side of the pty, and while there are still process with the pty open, no EOF is produced. * The sleep command exits, closes its descriptor, sshd detects the EOF, and the connection gets closed. Attempts at fixing this in sshd, and why they don't work: * SIGHUP the sshd's process group. - The shell is in its own process group. * Track process group IDs of all children before we reap them (via an extra field in Session structures which holds the pgid for each child pid), and SIGHUP the pgid when we reap. - Background commands are in yet another process group. * Close the connection when the child dies. - Background commands may need to write data to the connection. Also prematurely truncates output from some commands (scp server, the famous "dd if=/dev/zero bs=1000 count=100" case). Known-good workarounds: * bash: shopt huponexit on * tcsh: none * zsh: setopt HUP (usually the default setting) (taken from email from Jason Stone to openssh-unix-dev, 5 May 2001) * pdksh: ? This appears to affect rsh as well: it behaves the same with 'sleep 20 & exit'.
As has been stated several times on the list - further effort should be aimed at finding out why this "bug" does not manifest on OpenBSD. What aspect of fd semantics differs?
*** Bug 273 has been marked as a duplicate of this bug. ***
(From Jim Knoble) : "Known-good workarounds:" pdksh: set +o nohup (as long as the shell is a login shell)
Created attachment 557 [details] suggested fix (against 3.8p1) I've been running with this patch on various flavours of Linux, IRIX and SunOS since last September.
Comment on attachment 557 [details] suggested fix (against 3.8p1) damn MIME
Created attachment 572 [details] Unpacked attachment
Comment on attachment 572 [details] Unpacked attachment Just going through the diff now. Why these changes: >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) ??
... because EAGAIN is what quite a few SysV flavours return upon read()ing a tty whose controlling process has passed away. Contrast this with OpenBSD that returns EIO in the same situation.
Anything more on this?
I'll look at it after the next minor release, which is happening very soon.
Created attachment 667 [details] Update to -current Regen patch against -current. I'm not entirely sure this patch does the right thing for multiple concurrent sessions, but I need to read it some more. The setting of fds in session_exit_message looks racy to me at least.
Humm. Lately, I've been seeing missed SIGCHLD's (still with 3.6.1p2), particularly when the system running sshd is busy. Is this the race you're refering to? I'll investigate and report back here.
Please leave the bug assigned to the bugs list so can keep track of it.
Created attachment 801 [details] Patch submitted by John Bowman (bowman@math.ualberta.ca) This patch was attached to https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=39128 Also see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=24293 a regression which happened from another attempt to solve this hang.
1. That doesn't look like a fix - it appears to be a workaround by adding a exit delay. 2. We don't have access to that Redhat regression bug, it is limited to insiders only (why?). Guessing from the bugs filed on either side of it, it looks like it was filed way back in early 2001, so it may have been related to my early and misguided attempt to fix the issue. Those hacks were backed out before the release of openssh-2.5.0p1. Anyone trying to fix this bug should read comment #1, comment #2 and Marc Aurele La France's patches. So far, Marc seems to have struck closest to the root cause of this issue.
Well, I haven't had a lot of time to spend on this, but the hangs I'm still seeing after my patch are intermittent and difficult to duplicate. "High-ish" loads seem to trigger them, after, for example, whatever's been put in the background has had a chance to increase the load average. So there's definitely a timing issue here somewhere. The "child" sshd is left sitting in select() with SIGCHLD blocked after its own child (the shell or whatever) has been reaped. The few forays into the code I've had time to do so far have yet to explain, to me at least, how this can happen. I suspect client_alive_scheduled shouldn't be reset, but I've yet to confirm this, or even whether these remaining hangs are in fact related to the problem at hand here or some other new problem. I also agree adding a configurable delay doesn't solve anything. After all, what's a hang if not a delay? BTW, I reassigned the bug to myself as a remainder that I need to spend time on this. But if don't you want me to do this re-assignment, that's fine too.
Created attachment 1075 [details] Up-to-date hang-on-exit patch This is an up-to-date patch (based on Markus Friedl's suggestion) to fix the notorious hang-on-exit bug (which happens only with the portable version of OpenSSH). No data loss occurs with this patch: it does not break ssh or scp. The latest version of this patch will continue to be posted at the URL below (as it has been for many years now), until the openssh developers finally get around to applying it to their sources: http://www.math.ualberta.ca/imaging/snfs/openssh.html
Comment on attachment 1075 [details] Up-to-date hang-on-exit patch We won't be applying this patch because it wrong, and something we tried quite a few years ago. It is essentially the same as the old, broken hang-on-exit fix that I mistakenly added around ~2.2.x except that it papers over the obvious dataloss race by only chopping off the read fd when it is a tty. It will still lose data on ttys. The following is from a patched sshd, demonstrating the dataloss. The root of this is a race condition: the SIGCHLD can arrive *before* the read pipe is fully empty. > [djm@baragon djm]$ (set -e ; while [ 1 ] ; do > ssh -qttp2222 linux-qemu "dd if=/dev/zero bs=256k count=1" | > wc -c ; done) > 262142 > 262093 > 262096 > 262125 > 262096 > 260056 > 262116 > 262124 > 212974 > ^C[djm@baragon djm]$ I recommend that packegers of OpenSSH do *not* apply this patch.
> It is essentially the same as the old, broken hang-on-exit fix that I > mistakenly added around ~2.2.x except that it papers over the obvious dataloss > race by only chopping off the read fd when it is a tty. It will still lose data > on ttys. That is in fact is the correct behaviour. Please check out how rsh and the commercial ssh work: if the user types "exit" in a shell, all further I/O, both reads and writes, are ignored. The rationale for this long-standing convention is: if you want more I/O, don't type "exit"! One sensible use of this feature is to start up a program in the background (which tees its output both to a log file and stdout) and then exit the shell after viewing the initial output. But there are many other uses as well. The current behaviour of openssh is annoying and unacceptable as a drop-in replacement for rsh. > (set -e ; while [ 1 ] ; do > ssh -qttp2222 linux-qemu "dd if=/dev/zero bs=256k count=1" | > wc -c ; done) Because you tried to be clever and pretend to be an interactive tty session, even though you are not. You want to say this instead: (set -e ; while [ 1 ] ; do ssh -qp2222 linux-qemu "dd if=/dev/zero bs=256k count=1" | wc -c ; done) There is no data loss with this command (or even if a single -t is given, because ssh sensibly overrides this). So the question is, why do you need to pretend to require a tty? If there really is some sensible reason for forcing a tty allocation in this context, then just pass a flag from ssh to sshd to indicate that you have done so. It's a trivial modification, but I don't yet see the need. > I recommend that packegers of OpenSSH do *not* apply this patch. I highly recommend it; it's about time that this silly bug be fixed once and for all. Let's move forward now... Incidentally, to avoid confusion, please disregard the claim in Comment #16 that an earlier version of my hang-on-exit patch simply added an exit delay. The exit delay business was a separate option that was added at the request of another user. It has nothing at all to do with the hang-on-exit bug. Since no else seems to have been interested in that feature, I stopped maintaining that exit delay patch some time ago (although it would be easy for someone to port my old openssh-sleep.patch to the current version).
(In reply to comment #20) > That is in fact is the correct behaviour. Please check out how rsh and the > commercial ssh work: if the user types "exit" in a shell, all further I/O, > both reads and writes, are ignored. The rationale for this long-standing > convention is: if you want more I/O, don't type "exit"! The example test in Comment #19 *doesn't* do any IO after "dd" exits, but there is still data in the pipe/socket buffer that gets lost by your patch. The same applies to any interactive program that is producing data around the time it exits. > > (set -e ; while [ 1 ] ; do > > ssh -qttp2222 linux-qemu "dd if=/dev/zero bs=256k count=1" | > > wc -c ; done) > > Because you tried to be clever and pretend to be an interactive tty > session, even though you are not. You are completely missing the point. This is simple test to demonstrate that your patch loses data on ttys. A real-world mainfestation of this problem with your patch could be an interactive ncurses app's endwin() cleanups being truncated, resulting in a corrupted terminal state. > I highly recommend it; it's about time that this silly bug be fixed once > and for all. Let's move forward now... I'd love to see this bug fixed, but your patch introduces new problems. Just because you haven't noticed, or consider them relevant to you them doesn't mean that they don't exist.
Letting this bug languish for years without a solution isn't helping us (the users). We care about real-world use, and in real world use a hanging process that I've asked to exit is an administration burden. Auto-deploy/maintenance scripts can't be run automatically without writing some sort of shootdown watchdog, etc. This patch solves a real need. If you don't like the way this patch works, then by all means create an alternate solution, but let's get the problem fixed already. If you must, create an option to make this behavior selectable or better yet an environment variable. Something that I can set and just leave on all the time. It's not one author vs. openssh, there are easily thousands of ssh users that want this fixed (hell I've just talked with 8 of them and everyone is shocked to hear this is even a debate). We'll accept loosing the last few bytes if we can guarantee a disconnect. Please help us. On behalf of the lowly users, Sam (In reply to comment #21) > (In reply to comment #20) > > That is in fact is the correct behaviour. Please check out how rsh and the > > commercial ssh work: if the user types "exit" in a shell, all further I/O, > > both reads and writes, are ignored. The rationale for this long-standing > > convention is: if you want more I/O, don't type "exit"! > > The example test in Comment #19 *doesn't* do any IO after "dd" exits, but there > is still data in the pipe/socket buffer that gets lost by your patch. The same > applies to any interactive program that is producing data around the time it > exits. > > > > (set -e ; while [ 1 ] ; do > > > ssh -qttp2222 linux-qemu "dd if=/dev/zero bs=256k count=1" | > > > wc -c ; done) > > > > Because you tried to be clever and pretend to be an interactive tty > > session, even though you are not. > > You are completely missing the point. This is simple test to demonstrate that > your patch loses data on ttys. > > A real-world mainfestation of this problem with your patch could be an > interactive ncurses app's endwin() cleanups being truncated, resulting in a > corrupted terminal state. > > > I highly recommend it; it's about time that this silly bug be fixed once > > and for all. Let's move forward now... > > I'd love to see this bug fixed, but your patch introduces new problems. Just > because you haven't noticed, or consider them relevant to you them doesn't mean > that they don't exist. >
Created attachment 1098 [details] Updated patch against 4.3p2 Well, one effective way for a user to become "lowly" is to believe one is in the position to dictate. Often, in such cases, there's money involved. Attached is a rework of my prior fix, for 4.3p2 and to fix a few bugs I've run into since then, including the race condition mentioned earlier. This works with all protocol versions and there is no data loss. This version of the patch turns out to be quite small but you might wish to recode its overloading of detach_close. The intent here is to continue reading from the tty, after the child shell has terminated, until there is an error. On OpenBSD (and I suspect other CSRG-based variants), that error will be EIO, whereas on Linux, IRIX, SunOS, etc., it will be EAGAIN. That's because SysV variants do not "close" the tty when its controlling process exits. The reason the hang doesn't occur on OpenBSD is that the tty's change of status is reported through select(2), but isn't on SysV variants. Under compat20, this only affects the channel associated with the tty. Thus, if other channels are open (forwarding, etc.), the connection with the client will remain open. Damien, that `dd` command should redirect its stderr to /dev/null so you will always get 262144. A nitpick to be sure, but your posting was just the thing I needed to duplicate the problem reliably enough to debug it. Thanks.
(In reply to comment #24) > Anything more on this? I tried the latest version of patch and it fixes the hang-bug on Solaris 9 (SPARC) and SLES 9, but not on Tru64 UNIX 5.1A, Tru64 UNIX 5.1B or HP-UX 11.11. For the record, the test was this simple command; ssh server 'sleep 20 & exit'.
(In reply to comment #25) > (In reply to comment #24) > > Anything more on this? > > I tried the latest version of patch and it fixes the hang-bug on > Solaris 9 (SPARC) and SLES 9, but not on Tru64 UNIX 5.1A, Tru64 UNIX > 5.1B or HP-UX 11.11. > > For the record, the test was this simple command; > ssh server 'sleep 20 & exit'. I was a bit quick on the reply it seems. The Tru64 UNIX and HP-UX machines I tested just had a high load and it took some time (about 15 seconds) for SSH to actually connect, so the patch also works for Tru64 UNIX 5.1A and 5.1B as well as HP-UX 11.11.
Created attachment 1214 [details] Update Marc's patch to 4.5p1 Some changes in nearby code prevent the patch from applying cleanly to 4.5p1.
Created attachment 1215 [details] Updated Patch for 4.5p1 from Darren Tucker (corrected minor typo) I've applied the mentioned patch for openssh 4.5p1 the patch itself contains a minor typo (closing bracket) for line 1455 which is corrected in this version of the patch. The patch fixes the issue with "ssh -f ..." on AIX (tested&verified on AIX 5.3 ML 4) thanks Darren for your quick response to this problem
(In reply to comment #28) > thanks Darren for your quick response to this problem You're welcome but Marc deserves the credit for figuring it out, I just pointed you in the right direction.
I'm glad to see this change integrated. This report should be resolved as fixed, as I, for one, now consider this issue to have been corrected. Thanks.
Comment on attachment 1098 [details] Updated patch against 4.3p2 The patch has not been applied, I simply pointed someone toward it. >- FD_ISSET(c->rfd, readset)) { >+ (c->detach_close || FD_ISSET(c->rfd, readset))) { > len = read(c->rfd, buf, sizeof(buf)); >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) >+ if (len < 0 && errno == EINTR) I'm wondering if this could ever cause the channel to be shutdown if the read fails with EAGAIN during normal operation. POSIX says EAGAIN should be returned when the descriptor is set for nonblocking IO and select should not mark the fd in that case so this shouldn't happen. On the other hand, comment #9 has one counterexample, plus Google finds http://archives.neohapsis.com/archives/postfix/2003-09/0507.html, which indicates that select can fire on on a non-ready fd (probably not an issue here since it appears to affect only TCP sockets) but it makes me wonder what other instances are out there... [...] >+ program_alive_scheduled = child_terminated; Any reason this is a separate variable rather than just using child_terminated? A possible race where a SIGCHLD arrives after the read? On a related note, I'm also wondering if we be aiming to carry this as a portable-only change (since OpenBSD isn't directly affected and we already have to deal with some pty wackiness from AIX in this area of code, so patches won't apply cleanly already) or should we be aiming to push the change to OpenBSD too?
(In reply to comment #31) > (From update of attachment 1098 [details]) > The patch has not been applied, I simply pointed someone toward it. > >- FD_ISSET(c->rfd, readset)) { > >+ (c->detach_close || FD_ISSET(c->rfd, readset))) { > > len = read(c->rfd, buf, sizeof(buf)); > >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) > >+ if (len < 0 && errno == EINTR) > I'm wondering if this could ever cause the channel to be shutdown if > the read fails with EAGAIN during normal operation. > POSIX says EAGAIN should be returned when the descriptor is set for > nonblocking IO and select should not mark the fd in that case so this > shouldn't happen. > On the other hand, comment #9 has one counterexample, How so? > plus Google finds > http://archives.neohapsis.com/archives/postfix/2003-09/0507.html, which > indicates that select can fire on on a non-ready fd (probably not an > issue here since it appears to affect only TCP sockets) but it makes me > wonder what other instances are out there... Well, you could accept EAGAIN only before detach_close is set. In the !compat20 case, child_terminated. > >+ program_alive_scheduled = child_terminated; > Any reason this is a separate variable rather than just using > child_terminated? A possible race where a SIGCHLD arrives after the > read? It's been a while, but IIRC, I did this to ensure I was looking at the same child_terminated value across the select(2). I wanted to force read(2)'s only if I also forced select(2) to check the fd(s). Even so, I wish to emphasise here that it is not possible to miss the child's death: In the compat20 case, detach_close ends up taking over from child_terminated. I.e. once set, detach_close is permanent, but collect_children() resets child_terminated. In the !compat20 case, child_terminated isn't reset. > On a related note, I'm also wondering if we be aiming to carry this as > a portable-only change (since OpenBSD isn't directly affected and we > already have to deal with some pty wackiness from AIX in this area of > code, so patches won't apply cleanly already) or should we be aiming to > push the change to OpenBSD too? That's a decision only you can make. My personal view is that the portable version should also run on OpenBSD.
Comment on attachment 1215 [details] Updated Patch for 4.5p1 from Darren Tucker (corrected minor typo) Sorry for taking so long to review this. Some comments on the diff (I'd really like to see this in 4.6p1) >Index: channels.c >=================================================================== >RCS file: /usr/local/src/security/openssh/cvs/openssh/channels.c,v >retrieving revision 1.248 >diff -u -p -r1.248 channels.c >--- channels.c 30 Aug 2006 01:07:40 -0000 1.248 >+++ channels.c 27 Nov 2006 03:39:26 -0000 >@@ -1449,10 +1449,10 @@ channel_handle_rfd(Channel *c, fd_set *r > int len; > > if (c->rfd != -1 && >- FD_ISSET(c->rfd, readset)) { >+ (c->detach_close || FD_ISSET(c->rfd, readset))) { Is c->detach_close used here to chose "session channels that have exitied but not yet closed"? or something else? > errno = 0; > len = read(c->rfd, buf, sizeof(buf)); >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) >+ if (len < 0 && errno == EINTR) Can the second condition instead be made: && (errno == EINTR || c->detach_close || errno == EAGAIN) This would limit the scope of the change to only session channels which have already received a SIGCHLD. >@@ -1604,11 +1604,11 @@ channel_handle_efd(Channel *c, fd_set *r > c->local_consumed += len; > } > } else if (c->extended_usage == CHAN_EXTENDED_READ && >- FD_ISSET(c->efd, readset)) { >+ (c->detach_close || FD_ISSET(c->efd, readset))) { > len = read(c->efd, buf, sizeof(buf)); > debug2("channel %d: read %d from efd %d", > c->self, len, c->efd); >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) >+ if (len < 0 && errno == EINTR) Likewise. >Index: serverloop.c >=================================================================== >RCS file: /usr/local/src/security/openssh/cvs/openssh/serverloop.c,v >retrieving revision 1.150 >diff -u -p -r1.150 serverloop.c >--- serverloop.c 23 Oct 2006 17:02:41 -0000 1.150 >+++ serverloop.c 27 Nov 2006 03:37:16 -0000 >@@ -280,6 +280,7 @@ wait_until_can_do_something(fd_set **rea > struct timeval tv, *tvp; > int ret; > int client_alive_scheduled = 0; >+ int program_alive_scheduled = 0; > > /* > * if using client_alive, set the max timeout accordingly, >@@ -317,6 +318,7 @@ wait_until_can_do_something(fd_set **rea > * the client, try to get some more data from the program. > */ > if (packet_not_very_much_data_to_write()) { >+ program_alive_scheduled = child_terminated; Why not just use child_terminated directly here? Is it because of children exiting during select() >- } else if (ret == 0 && client_alive_scheduled) >- client_alive_check(); >+ } else { >+ if (ret == 0 && client_alive_scheduled) >+ client_alive_check(); >+ if (program_alive_scheduled && fdin_is_tty) { >+ if (!fdout_eof) >+ FD_SET(fdout, *readsetp); >+ if (!fderr_eof) >+ FD_SET(fderr, *readsetp); >+ } >+ } I think the condition for this if() should include !compat20 (I know that fdin_is_tty is never set for !compat20, but you have to hunt for it). > > notify_done(*readsetp); > } >@@ -407,7 +417,7 @@ process_input(fd_set *readset) > if (!fdout_eof && FD_ISSET(fdout, readset)) { > errno = 0; > len = read(fdout, buf, sizeof(buf)); >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) { >+ if (len < 0 && errno == EINTR) { > /* do nothing */ > #ifndef PTY_ZEROREAD > } else if (len <= 0) { >@@ -425,7 +435,7 @@ process_input(fd_set *readset) > if (!fderr_eof && FD_ISSET(fderr, readset)) { > errno = 0; > len = read(fderr, buf, sizeof(buf)); >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) { >+ if (len < 0 && errno == EINTR) { similar comment as for channels.c bit: can we use program_terminated here instead of c->detach close?
(In reply to comment #33) > > errno = 0; > > len = read(c->rfd, buf, sizeof(buf)); > >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) > >+ if (len < 0 && errno == EINTR) > > Can the second condition instead be made: > > && (errno == EINTR || c->detach_close || errno == EAGAIN) meh, I meant: && (errno == EINTR || (errno == EAGAIN && !c->detach_close))
Created attachment 1227 [details] patch against current 20070121 This patch makes the changes I mentioned in the previous comment, and turns on the immediate close behaviour for ttys only when protocol 2 is in use (it was on for ttys only for protocol 1 already). Testcases for checking tty vs. non-tty behaviour: 1. ssh -1tt host "echo go; (sleep 10; echo done) & exit" 2. ssh -2tt host "echo go; (sleep 10; echo done) & exit" TTY case: should see "go" and an immediate exit 3. ssh -1 host "echo go; (sleep 10; echo done) & exit" 4. ssh -2 host "echo go; (sleep 10; echo done) & exit" No TTY case: should see "go" and then "done" after a 10 second delay.
Comment on attachment 1227 [details] patch against current 20070121 attachment 1227 [details] tests OK on Linux and OpenSolaris, passing all regress tests and the tests mentioned in comment 19 and comment 35
(In reply to comment #33) > (From update of attachment 1215 [details]) > Sorry for taking so long to review this. Some comments on the diff (I'd > really like to see this in 4.6p1) > >Index: channels.c > >=================================================================== > >RCS file: /usr/local/src/security/openssh/cvs/openssh/channels.c,v > >retrieving revision 1.248 > >diff -u -p -r1.248 channels.c > >--- channels.c 30 Aug 2006 01:07:40 -0000 1.248 > >+++ channels.c 27 Nov 2006 03:39:26 -0000 > >@@ -1449,10 +1449,10 @@ channel_handle_rfd(Channel *c, fd_set *r > > int len; > > > > if (c->rfd != -1 && > >- FD_ISSET(c->rfd, readset)) { > >+ (c->detach_close || FD_ISSET(c->rfd, readset))) { > Is c->detach_close used here to chose "session channels that have > exitied but not yet closed"? or something else? Yes. > > errno = 0; > > len = read(c->rfd, buf, sizeof(buf)); > >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) > >+ if (len < 0 && errno == EINTR) > Can the second condition instead be made: > && (errno == EINTR || c->detach_close || errno == EAGAIN) Yes (with your correction from comment #34). > This would limit the scope of the change to only session channels which > have already received a SIGCHLD. > >@@ -1604,11 +1604,11 @@ channel_handle_efd(Channel *c, fd_set *r > > c->local_consumed += len; > > } > > } else if (c->extended_usage == CHAN_EXTENDED_READ && > >- FD_ISSET(c->efd, readset)) { > >+ (c->detach_close || FD_ISSET(c->efd, readset))) { > > len = read(c->efd, buf, sizeof(buf)); > > debug2("channel %d: read %d from efd %d", > > c->self, len, c->efd); > >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) > >+ if (len < 0 && errno == EINTR) > Likewise. > >Index: serverloop.c > >=================================================================== > >RCS file: /usr/local/src/security/openssh/cvs/openssh/serverloop.c,v > >retrieving revision 1.150 > >diff -u -p -r1.150 serverloop.c > >--- serverloop.c 23 Oct 2006 17:02:41 -0000 1.150 > >+++ serverloop.c 27 Nov 2006 03:37:16 -0000 > >@@ -280,6 +280,7 @@ wait_until_can_do_something(fd_set **rea > > struct timeval tv, *tvp; > > int ret; > > int client_alive_scheduled = 0; > >+ int program_alive_scheduled = 0; > > > > /* > > * if using client_alive, set the max timeout accordingly, > >@@ -317,6 +318,7 @@ wait_until_can_do_something(fd_set **rea > > * the client, try to get some more data from the program. > > */ > > if (packet_not_very_much_data_to_write()) { > >+ program_alive_scheduled = child_terminated; > Why not just use child_terminated directly here? Is it because of > children exiting during select() Yes, as I infer in comment #32. > >- } else if (ret == 0 && client_alive_scheduled) > >- client_alive_check(); > >+ } else { > >+ if (ret == 0 && client_alive_scheduled) > >+ client_alive_check(); > >+ if (program_alive_scheduled && fdin_is_tty) { > >+ if (!fdout_eof) > >+ FD_SET(fdout, *readsetp); > >+ if (!fderr_eof) > >+ FD_SET(fderr, *readsetp); > >+ } > >+ } > I think the condition for this if() should include !compat20 (I know > that fdin_is_tty is never set for !compat20, but you have to hunt for > it). Ok. > > > > notify_done(*readsetp); > > } > >@@ -407,7 +417,7 @@ process_input(fd_set *readset) > > if (!fdout_eof && FD_ISSET(fdout, readset)) { > > errno = 0; > > len = read(fdout, buf, sizeof(buf)); > >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) { > >+ if (len < 0 && errno == EINTR) { > > /* do nothing */ > > #ifndef PTY_ZEROREAD > > } else if (len <= 0) { > >@@ -425,7 +435,7 @@ process_input(fd_set *readset) > > if (!fderr_eof && FD_ISSET(fderr, readset)) { > > errno = 0; > > len = read(fderr, buf, sizeof(buf)); > >- if (len < 0 && (errno == EINTR || errno == EAGAIN)) { > >+ if (len < 0 && errno == EINTR) { > similar comment as for channels.c bit: can we use program_terminated > here instead of c->detach close? Yes, but 's/program_terminated/child_terminated/' as you have it in attachment #1227 [details].
Comment on attachment 1227 [details] patch against current 20070121 This looks good to me.
Comment on attachment 1227 [details] patch against current 20070121 Is this intended to be a Portable-only change or is it intended to be fed back to OpenBSD too?
(In reply to comment #39) > Is this intended to be a Portable-only change or is it intended to be > fed back to OpenBSD too? I think it should be portable-only. OpenBSD has never "hung" at exit, and the patch brings other systems into line with the OpenBSD behaviour. So there is no need for the patch there, and only risk (albeit minor).
Comment on attachment 1227 [details] patch against current 20070121 I have tested this on pretty much everything I have (FC6, OpenBSD 4.0, NetBSD 3.1, AIX 4+5, Solaris 2.5+8, HP-UX 11). I ran both the normal regress tests and Damien's test case and both passed. I did note that "ssh server 'sleep 20 & exit'" still hangs but I think that's supposed to since the descriptor is still open. Redirecting stderr and stdout lets the connection close right away.
(In reply to comment #41) > (From update of attachment 1227 [details]) > I have tested this on pretty much everything I have (FC6, OpenBSD 4.0, > NetBSD 3.1, AIX 4+5, Solaris 2.5+8, HP-UX 11). I ran both the normal > regress tests and Damien's test case and both passed. > I did note that "ssh server 'sleep 20 & exit'" still hangs but I think > that's supposed to since the descriptor is still open. Redirecting > stderr and stdout lets the connection close right away. Yes, that's correct. One of my goals for this change was to make portable sshd's behaviour in this respect match OpenBSD sshd's.
This patch has been committed and will be in OpenSSH-4.6p1. Thanks to Marc Aurele La France for the great work in tracking this down and fixing it!
I am under the impression that the latest patch is not fixing the problem in HP-UX 11.11. I run the following to reproduce the problem. # cat test1.ksh #!/usr/bin/ksh while (( 1 )) do sleep 10 done # cat test2.ksh #!/usr/bin/ksh ./test1.ksh &
(In reply to comment #44) > #!/usr/bin/ksh > ./test1.ksh & This leaves std{in,out,err} connected. Try redirecting them to /dev/null instead: ./test1/ksh </dev/null >/dev/null 2>&1
Yes, but that workaround doesn't need the patch. However, with latest patch, there is no hang when I run test2.ksh and exit from the shell on linux. But it is not the case with HP-UX. So Im wondering what the patch is doing with respect to HP-UX.
(In reply to comment #46) > Yes, but that workaround doesn't need the patch. However, with latest > patch, there is no hang when I run test2.ksh and exit from the shell on > linux. But it is not the case with HP-UX. So Im wondering what the > patch is doing with respect to HP-UX. This is at odds with comment #26.
(In reply to comment #47) > (In reply to comment #46) > > Yes, but that workaround doesn't need the patch. However, with latest > > patch, there is no hang when I run test2.ksh and exit from the shell on > > linux. But it is not the case with HP-UX. So Im wondering what the > > patch is doing with respect to HP-UX. > > This is at odds with comment #26. > I checked the testcases described in <A href="http://bugzilla.mindrot.org/show_bug.cgi?id=52#c35">comment 35</a> and they worked as described there for both HP-UX and Linux. But HP-UX behaves differently only when I login and launch my script described in <A href="http://bugzilla.mindrot.org/show_bug.cgi?id=52#c44">comment 44</a>. So I think there is something special that goes in HP-UX at login which is not happening when I just run with -t option.
(In reply to comment #48) > (In reply to comment #47) > > (In reply to comment #46) > > > Yes, but that workaround doesn't need the patch. However, with > > > latest patch, there is no hang when I run test2.ksh and exit from > > > the shell on linux. But it is not the case with HP-UX. So Im > > > wondering what the patch is doing with respect to HP-UX. > > This is at odds with comment #26. > I checked the testcases described in <A > href="http://bugzilla.mindrot.org/show_bug.cgi?id=52#c35">comment > 35</a> and they worked as described there for both HP-UX and Linux. > But HP-UX behaves differently only when I login and launch my script > described in <A > href="http://bugzilla.mindrot.org/show_bug.cgi?id=52#c44">comment > 44</a>. So I think there is something special that goes in HP-UX at > login which is not happening when I just run with -t option. OK. So, what happens if you do ... ./test2.ksh & exit ... when logged in?
(In reply to comment #49) > (In reply to comment #48) > > (In reply to comment #47) > > > (In reply to comment #46) > > > > Yes, but that workaround doesn't need the patch. However, with > > > > latest patch, there is no hang when I run test2.ksh and exit from > > > > the shell on linux. But it is not the case with HP-UX. So Im > > > > wondering what the patch is doing with respect to HP-UX. > > > > This is at odds with comment #26. > > > I checked the testcases described in <A > > href="http://bugzilla.mindrot.org/show_bug.cgi?id=52#c35">comment > > 35</a> and they worked as described there for both HP-UX and Linux. > > But HP-UX behaves differently only when I login and launch my script > > described in <A > > href="http://bugzilla.mindrot.org/show_bug.cgi?id=52#c44">comment > > 44</a>. So I think there is something special that goes in HP-UX at > > login which is not happening when I just run with -t option. > > OK. So, what happens if you do ... > > ./test2.ksh & > exit > > ... when logged in? > Sorry for the embedded tags in my last comment. To answer the question, $ ./test2.ksh & [1] 2659 $ exit There are running jobs. [1] + Done ./test2.ksh & $ exit logout <It hangs after this> This happens irrespective of the patch in HP-UX. I'm using korn shell. Not sure is this something shell specific. I guess it can't be, because if I do the same steps after login through telnet to a HP-UX box, the connection get closed.
sigh. five comments and no sign of a debug trace or a description of your configuration (anything non default?).
Created attachment 1242 [details] Debug output on sshd The debug output on sshd is attached here. I'm using OpenSSH 4.5 and the configure options are below, ./configure --prefix=<foo> --with-zlib=<foo> --with-tcp-wrappers=<foo> As far as client side is concerned, its normal debug output and as soon as I gave exit the following happens on client side, $ ./test2.ksh & [1] 18353 $ exit There are running jobs. [1] + Done ./test2.ksh & $ exit logout debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 I will upload the ssh client side debug output if you feel it is needed.
(In reply to comment #51) > sigh. five comments and no sign of a debug trace or a description of > your configuration (anything non default?). > No, I just used the default sshd_config and ssh_config for this.
(In reply to comment #52) > Created an attachment (id=1242) [details] > Debug output on sshd > The debug output on sshd is attached here. I'm using OpenSSH 4.5 and > the configure options are below, > ./configure --prefix=<foo> --with-zlib=<foo> --with-tcp-wrappers=<foo> > As far as client side is concerned, its normal debug output and as soon > as I gave exit the following happens on client side, > $ ./test2.ksh & > [1] 18353 > $ exit > There are running jobs. > [1] + Done ./test2.ksh & > $ exit > logout > debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 > I will upload the ssh client side debug output if you feel it is > needed. Does it behave differently with protocol version 1, i.e. `ssh -1`?
Created attachment 1243 [details] SSHD debug output for protocol 1 No difference between protocol 1 and 2. They simply hang. The client side for protocol 1 is shown below, $ ./test2.ksh & [1] 3428 $ There are running jobs. [1] + Done ./test2.ksh & $ logout
(In reply to comment #55) > Created an attachment (id=1243) [details] > SSHD debug output for protocol 1 > No difference between protocol 1 and 2. They simply hang. The client > side for protocol 1 is shown below, > $ ./test2.ksh & > [1] 3428 > $ There are running jobs. > [1] + Done ./test2.ksh & > $ > logout OK, then, on HP-UX, is EAGAIN != EWOULDBLOCK?
Close resolved bugs after release.
(In reply to comment #57) > Close resolved bugs after release. i think this problem is still exists in hp-ux. if we use "ssh server 'sleep 50 & exit'", the server will hang there 50 secs and then exit, that is to say, it exits until sleep 50 secs. and when i use an infinite loop and make it run background, and then use exit to logout, ssh will still hang there. i debug the code, and found in channel_garbage_collect(channels.c), chan_is_dead is return 1 for c->isatate and c->ostate equals CHAN_INPUT_OPEN. so the channel remains open. Can someone help me check this problem. the script i use is just this: #! /sbin/sh while (( 1 )) do sleep 10 >/dev/null 2>&1 done