I was having a problem all weekend where UsePrivilegeSeparation was on, and users were being authenticated through PAM modules. I would continuously get ssh_exchange_identification errors. Generally this is a hosts.allow/.deny problem. However, after running into this problem 3 times, I determined this was not the problem. The problem has to do with something between sshd and PAM during privilege separation. I was randomly getting several "sshd: <user> [pam]" processes in my "ps ax" list. When the maximum unauthenticated connetion limit was reached, no one could login. Turning privilege separation off seems to remove the problem. It is also important to make sure ssh* binaries are not setuid root in this case. Use SELinux or similar if you feel you need more security. However, I would like privilege separation fixed.
Created attachment 600 [details] Reset thread status Please try this patch (which has already been committed to -current, auth-pam.c rev 1.97) or try a snapshot.
BTW the only binary that should be setuid is ssh-keysign (and possibly ssh, but only if you use a server that requires connections from low-numbered ports, eg for RSARhosts authentication).
The patch on this bug is in 3.8.1p1, so I think this is fixed. Does the problem still occur with that version?
Created attachment 639 [details] Signal PAM "thread" if SIGCHLD is caused by the privsep slave exitting Colin Watson pointed out that this may correspond to a Debian bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=248125 It appears that what is happening is that the client exits, breaking the TCP connection. When that happens, the privsep slave exits too, which causes a SIGCHLD to be delivered to the monitor. The monitor then attempts to waitpid() on the PAM "thread" which is still alive and blissfully unaware of a problem (because nobody told it to die). That waitpid hangs the monitor's cleanup. The attached patch tests adds a test for this case to the signal handler to shoot the PAM thread itself if it has to. It the same as the one I sent to the Debian bug except it resets SIGCHLD to prevent reentering the signal handler when the second process exits.
Comment on attachment 639 [details] Signal PAM "thread" if SIGCHLD is caused by the privsep slave exitting Looks sane to me.
Thanks, patch id #639 has just been committed (to both HEAD and 3.8.1 branch). William, could you please try either the patch or a snapshot[1] and confirm whether or not the problem is fixed for you? [1] ftp://ftp.openbsd.org/pub/OpenBSD/OpenSSH/portable/snapshot/ or one of its mirrors.
Mario Holbe reports that the patch has been applied to Debian (unstable) and fixes the problem for him. I think this is now fixed, so I'm resolving this bug. If you can reproduce your problem with either a current snapshot or 3.8.1p1 with patch id #639 then please reopen this bug.
There is a bug in the patch: waitpid() with ENOHANG can return 0 if the child is still alive. The corresponding piece of code in sshpam_sigchld_handler() should look like this one: + int res; ... + res = waitpid(cleanup_ctxt->pam_thread, &sshpam_thread_status, WNOHANG); + if (res == 0 || res == -1) { + /* PAM thread has not exitted, privsep slave must have */ + kill(cleanup_ctxt->pam_thread, SIGTERM); + res = waitpid(cleanup_ctxt->pam_thread, &sshpam_thread_status, 0); + if (res == -1) + return; /* could not wait */ + }
This has already been fixed in -current: 20040711 - (dtucker) [auth-pam.c] Check for zero from waitpid() too, which allows the monitor to properly clean up the PAM thread (Debian bug #252676).