on Debian Linux ( sarge ) with kernel 2.6.9 hangs a non privileged thread from sshd if esecuted command returns. Not every request hangs, but a lot: 4683 ? Ss 0:00 /usr/sbin/sshd 6295 ? Ss 0:00 \_ sshd: root@notty 6297 ? Zs 0:00 | \_ [check_dpt] <defunct> 8000 ? Ss 0:00 \_ sshd: root@notty 8002 ? Zs 0:00 | \_ [check_dpt] <defunct> 8048 ? Ss 0:00 \_ sshd: root@notty 8050 ? Zs 0:00 | \_ [check_dpt] <defunct> 8063 ? Ss 0:00 \_ sshd: root@notty 8065 ? Zs 0:00 | \_ [check_dpt] <defunct> 8078 ? Ss 0:00 \_ sshd: root@notty 8080 ? Zs 0:00 | \_ [check_dpt] <defunct> 8098 ? Ss 0:00 \_ sshd: root@notty 8100 ? Zs 0:00 \_ [check_dpt] <defunct> Dimitrij
Is this built with PAM? Does the problem occur with 3.9p1? There was a bug that was fixed in 3.9p1 relating the the handling of SIGCHLD in the PAM code which could possibly be the cause of this.
It is standard debian sarge ( testing ) sshd an was build witch PAM. It is 3.4p1.
3.4p1 is 2.5 years old. Please try to reproduce this problem with 3.9p1, or you can take the bug up with your OS vendor if they refuse to provide a non-ancient version.
why does 'UsePAM no' in sshd_config not solve this problem?
The UsePAM sshd_config(5) directive was introduced in 3.7p1. Prior to that, it was a compile-time directive only (configure --with-pam). That said, the bug I was referring to was introduced around 3.8ish, so it can't be the cause of your problem. Can you reproduce the problem with 3.9p1? If not, please close this bug and report it to Debian.
This problem still exists with 3.9p1-1 from debian/experimental too. strace: Process 11107 attached - interrupt to quit futex(0xb7e573cc, FUTEX_WAIT, 2, NULL^X <unfinished ...> gdb: balancedev2:~# gdb -p 11107 GNU gdb 6.3-debian Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-linux". Attaching to process 11107 Using host libthread_db library "/lib/tls/libthread_db.so.1". warning: could not load vsyscall page because no executable was specified warning: try using the "file" command first Reading symbols from /usr/sbin/sshd...(no debugging symbols found)...done. Reading symbols from /lib/libwrap.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libwrap.so.0 Reading symbols from /lib/libpam.so.0... (no debugging symbols found)...done. Loaded symbols for /lib/libpam.so.0 Reading symbols from /lib/tls/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/tls/libdl.so.2 Reading symbols from /lib/tls/libresolv.so.2... (no debugging symbols found)...done. Loaded symbols for /lib/tls/libresolv.so.2 Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.7...(no debugging symbols found)...done. Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.7 Reading symbols from /lib/tls/libutil.so.1... (no debugging symbols found)...done. Loaded symbols for /lib/tls/libutil.so.1 Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /lib/tls/libnsl.so.1... (no debugging symbols found)...done. Loaded symbols for /lib/tls/libnsl.so.1 Reading symbols from /lib/tls/libcrypt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/tls/libcrypt.so.1 Reading symbols from /lib/tls/libpthread.so.0... (no debugging symbols found)...done. [Thread debugging using libthread_db enabled] [New Thread -1210984832 (LWP 11107)] Loaded symbols for /lib/tls/libpthread.so.0 Reading symbols from /lib/tls/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/tls/libc.so.6 Reading symbols from /lib/ld-linux.so.2... (no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/tls/libnss_compat.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/tls/libnss_compat.so.2 Reading symbols from /lib/tls/libnss_nis.so.2... (no debugging symbols found)...done. Loaded symbols for /lib/tls/libnss_nis.so.2 Reading symbols from /lib/tls/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/tls/libnss_files.so.2 Reading symbols from /lib/tls/libnss_dns.so.2... (no debugging symbols found)...done. Loaded symbols for /lib/tls/libnss_dns.so.2 0xb7e077a1 in pthread_setcanceltype () from /lib/tls/libc.so.6 (gdb) bt #0 0xb7e077a1 in pthread_setcanceltype () from /lib/tls/libc.so.6 #1 0x00000001 in ?? () #2 0xb7e55fcc in ?? () from /lib/tls/libc.so.6 #3 0x00000001 in ?? () #4 0xb7df7523 in setlogmask () from /lib/tls/libc.so.6 #5 0xbfffdd74 in ?? () #6 0xb7df74a0 in setlogmask () from /lib/tls/libc.so.6 #7 0x00000000 in ?? () #8 0x00000001 in ?? () #9 0x00000000 in ?? () #10 0xbfffe174 in ?? () #11 0xbfffdd74 in ?? () #12 0x00000007 in ?? () #13 0xbfffe58c in ?? () #14 0x0806f3ac in error () #15 0x0806f150 in error () #16 0x08053301 in ?? () #17 0x0807ff45 in _IO_stdin_used () #18 0xb7e46d5b in in6addr_loopback () from /lib/tls/libc.so.6 #19 0xbfffea6c in ?? () #20 0xbfffe63c in ?? () #21 0x00000009 in ?? () #22 0x00000000 in ?? () ---Type <return> to continue, or q <return> to quit--- #23 0x00000008 in ?? () #24 <signal handler called> #25 0xb7dfba1c in send () from /lib/tls/libc.so.6 #26 0xb7df6e10 in vsyslog () from /lib/tls/libc.so.6 #27 0xb7df6aaf in syslog () from /lib/tls/libc.so.6 #28 0x0806f3c1 in error () #29 0x0806f150 in error () #30 0x08054e27 in ?? () #31 0x080800d4 in _IO_stdin_used () #32 0x00000005 in ?? () #33 0xbffff358 in ?? () #34 0x08053d5f in ?? () #35 0x0808f334 in stdin () #36 0x080532e0 in ?? () #37 0x00000000 in ?? () #38 0x00000000 in ?? () #39 0x08092c20 in stdin () #40 0x00000000 in ?? () #41 0xbffff358 in ?? () #42 0x00000000 in ?? () #43 0x00000000 in ?? () #44 0x00000000 in ?? () #45 0x00000000 in ?? () ---Type <return> to continue, or q <return> to quit--- #46 0x00002b65 in ?? () #47 0x08092c20 in stdin () #48 0xbffff380 in ?? () #49 0xbffff398 in ?? () #50 0x0805793a in ?? () #51 0x00002b65 in ?? () #52 0x00000005 in ?? () #53 0x00000005 in ?? () #54 0x00000007 in ?? () #55 0x080965c0 in ?? () #56 0x0809bc62 in ?? () #57 0x00000006 in ?? () #58 0x00000007 in ?? () #59 0x00000004 in ?? () #60 0x00000005 in ?? () #61 0xbffff3a8 in ?? () #62 0x080965c0 in ?? () #63 0x08092c20 in stdin () #64 0x00000000 in ?? () #65 0xbffff3b8 in ?? () #66 0x08057dcc in ?? () #67 0x08092c20 in stdin () #68 0x080965c0 in ?? () ---Type <return> to continue, or q <return> to quit--- #69 0xbffff3c4 in ?? () #70 0x080723e9 in error () Previous frame inner to this frame (corrupt stack?) Dimi
Hrm... > Reading symbols from /lib/tls/libpthread.so.0... It looks like Debian built sshd with the pthread hack, which is unsupported (and opens a whole other can of worms). Can you reproduce it with 3.9p1 with "UsePAM no" in sshd_config?
Update: only ssh -1 is brocken ( many command only keys are ssh1 ) ssh -2 is Ok.
That's interesting. Could you please attach (note: please use "create attachment", don't paste into the text field) your sshd_config. Also, could you please try running the server in debug mode ("/path/to/sshd -ddde -p 2022" then connect to port 2022) and if you can reproduce the problem with it in debug mode, please attach that log separately.
Created attachment 761 [details] My sshd_config
this prblem can't be reproduced in debug mode. My sshd_config is now in attacment.
OK, if it can't be provoked in debug mode then the other option is to kick the debugging up to DEBUG3 and get the messages from syslog. I see that you already have the LogLevel at DEBUG, which may have some clues. Could you grep a failing session out by pid and attach that? Also, if you kill one of the "sshd: root@notty" processes does the "defunct" process vanish too? I've seen defunct processes on Linux wedge unkillably, usually when they get straced.
Created attachment 762 [details] Logs from parent
Yes, if i kill the parent process, then the child zombie dies too. SINCHLD handler buggy?
It could be a problem with the SIGCHLD handler but I have a feeling it's some kind of race. Do you see the problem running, say, "sleep 1" compared to "true" via ssh ? I'm guessing that if the command exits *really* fast then the SIGCHLD might be delivered before the handler is set up. If that's true then I would expect the sleep's to be problem free but the true's to exhibit the problem.
Bingo! With "sleep 1" sshd works Ok. first commando with true make a zombie. Cause this issue i never got problem with -ddd.
Created attachment 763 [details] check for pending child after setting up handler Here's a patch to try, against 3.9p1. It will (assuming I got it right :-) check for a pending child and set the appropriate flags immediately after setting up the SIGCHLD handler. I would guess that the reason it doesn't happen with debugging on is that the debugging changes the timing enough for it miss the window.
Created attachment 764 [details] check for pending child after setting up handler Oops, looks like I didn't get it right after all... Try this one instead.
Hi, i tried to patch 3.8p1 with second patch. Same problem withc /bin/true as command. I will try 3.9p1 tommorow, but interesting is it for 3.8p1 because it is default in debian stable/testing. So i think debian maintainer mus backport your patch too.
If the patch didn't help then the problem is probably elsewhere. Could you please bump LogLevel to DEBUG3 and grep a failing session out of syslog as mentioned earlier?
Created attachment 767 [details] hangs with DEBUG3
Hi, I'v patched 3.8p1 with second patch and get on client everytimes an error now: Received disconnect from 10.0.0.3: wait: Bad file descriptor
So I understand you: with patch #764 you get that error but it doesn't hang? If so then we're probably on the the right track, but my patch isn't quite right.
Please double-check that your're using the patch in attachment #764 [details]? I saw that bogus disconnect error with the older patch but I can't reproduce it with #764.
I will now compile new package with Patch #764 and try it again.
Ok, now is problem not everytime but not solved complete: this is Ok: host:~\> time for I in `seq 1 100`; do ssh -1 root@balancedev2 true; done real 1m19.477s user 0m4.682s sys 0m2.609s but if i run same commando from another shell parallel, then first child on server get a zombie when 2. command was started. Very strange!
So far I haven't been able to reproduce this. What can you tell me about the host itself? Is it fast or slow? Single/multiple CPU?
Hi, this machine isn't very fast: balancedev2:~# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 11 model name : Intel(R) Pentium(R) III CPU family 1133MHz stepping : 1 cpu MHz : 1130.892 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 2220.03 balancedev2:~# free total used free shared buffers cached Mem: 506788 471804 34984 0 113852 277188 -/+ buffers/cache: 80764 426024 Swap: 208836 0 208836
On another server same issue with patch too. Only DEBUG3 solves this problem.
Hi, could you verify this issue? Beacause tis Bug we cann't sitch to 2.6-er Kernel. We have a lot of cmmand-only keys -:( Dimi
Are you patching Debian's OpenSSH, or are you building from the sources that we release?
Hi, i'v patched and recompiled debian package. Dimi
Please rebuild from our sources rather than the Debian package. Debian make changes, like enabling threads, that we specifically warn are dangerous and unsupported. We cannot support the Debian package.
HI, i'v recompiled sshd from original source whith: ./configure --prefix=/usr --sysconfdir=/etc/ssh --libexecdir=/usr/lib --mandir=/usr/share/man --with-tcp-wrappers --with-xauth=/usr/bin/X11/xauth --with-default-path=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin --with-superuser-path=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin --with-pam --with-4in6 --with-privsep-path=/var/run/sshd --without-rand-helper Patch was added to, but it is not solves problem with '/bin/true' as command. I got lot of zombies again
Hi, it seems happens only if ssh ist compiled wit '--with-pam' and with -DUSE_POSIX_THREADS ( debian defualt ). Without this defines fails ssh 3.8.p1 not too. Dimi
Hmmm... I have installed Debian-SSH ( without modifikation ). It seems to happens only with LogLevel >= DEBUG. LogLevel <= VERBOSE is Ok. Dimi
Please turn *off* threads in your build. Like I said: they are completely unsupported.
Threads are unsupported - don't use them, we don't accept bug report against them.
Change all RESOLVED bug to CLOSED with the exception of the ones fixed post-4.4.