Created attachment 1766 [details] ignore SIGHUP across sshd re-exec window In Ubuntu bug #497781, "PierreF" reported that it's sometimes possible to end up with no sshd running if you send it two SIGHUP signals in quick succession, which can sometimes happen due to configuration of networking scripts. Although I haven't been able to reproduce this myself so far, I think this is because SIGHUP is reset to the default action by execve(), and sshd's handler isn't reinstalled until shortly after the exec, so there's a window when it's simply set to the default action of terminating the process. If this hypothesis is correct, which I think is likely, then the attached patch should fix it by ignoring SIGHUP across the exec window.
Seems like a reasonable hypothesis, but I don't see the patch making any difference. The execv will result in an entirely new process address space (including address layout randomization on platforms that have it) and the disposition of the old process' signal handlers will be irrelevant. You'd still have a window until the signal handler is reinstalled where the default action of SIGHUP would kill sshd. You could minimize this window by moving the "signal(SIGCHLD, main_sigchld_handler)" to the start of main(). This wouldn't eliminate the window but it would shrink it a lot (particularly because the generation of the protocol 1 ephemeral server key would no longer be in the window).
That isn't how execve() works, though. Signal dispositions are inherited, not reset by the action of loading the new process image. Lots of things wouldn't work if things were the way you posit - for example, nohup(1) would be entirely non-functional. Here's a transcript of a test demonstrating that ignoring SIGHUP before execve() is effective. Does this convince you? $ cat t.c #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <signal.h> typedef void (*sighandler_t)(int); extern char **environ; int main(int argc, char **argv) { if (getenv("SECOND_TIME")) { sighandler_t prev = signal(SIGHUP, SIG_DFL); if (prev == SIG_IGN) { printf("SIGHUP was SIG_IGN\n"); } else { printf("SIGHUP was not SIG_IGN\n"); } exit(0); } else { setenv("SECOND_TIME", "1", 1); if (getenv("IGNORE_SIGHUP")) signal(SIGHUP, SIG_IGN); execve(argv[0], argv, environ); } } $ make t cc t.c -o t $ ./t SIGHUP was not SIG_IGN $ IGNORE_SIGHUP=1 ./t SIGHUP was SIG_IGN
BTW, I agree that ASLR makes a difference when the signal handler is a function, but in this case that is not so. SIG_IGN is (typically; certainly on Linux, I imagine on other Unixes too) just a constant and not affected by ASLR.
I notice that the original reporter of https://bugs.launchpad.net/ubuntu/+source/openssh/+bug/497781 confirmed in a comment on that bug that my patch appears to fix the problem for him.
Fair enough, I've never actually thought about how nohup works :-) Adding to the list for 5.4.
Patch applied, thanks. It will be in the 5.4 release.
With the release of 5.4p1, this bug is now considered closed.