| Summary: | Command only sessions hangs on target system. | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Portable OpenSSH | Reporter: | Dimitrij Hilt <dimitrij> | ||||||||||||
| Component: | sshd | Assignee: | OpenSSH Bugzilla mailing list <openssh-bugs> | ||||||||||||
| Status: | CLOSED INVALID | ||||||||||||||
| Severity: | normal | ||||||||||||||
| Priority: | P2 | ||||||||||||||
| Version: | 3.9p1 | ||||||||||||||
| Hardware: | All | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Dimitrij Hilt
2004-12-29 20:27:53 AEDT
Is this built with PAM? Does the problem occur with 3.9p1? There was a bug that was fixed in 3.9p1 relating the the handling of SIGCHLD in the PAM code which could possibly be the cause of this. It is standard debian sarge ( testing ) sshd an was build witch PAM. It is 3.4p1. 3.4p1 is 2.5 years old. Please try to reproduce this problem with 3.9p1, or you can take the bug up with your OS vendor if they refuse to provide a non-ancient version. why does 'UsePAM no' in sshd_config not solve this problem? The UsePAM sshd_config(5) directive was introduced in 3.7p1. Prior to that, it was a compile-time directive only (configure --with-pam). That said, the bug I was referring to was introduced around 3.8ish, so it can't be the cause of your problem. Can you reproduce the problem with 3.9p1? If not, please close this bug and report it to Debian. This problem still exists with 3.9p1-1 from debian/experimental too. strace: Process 11107 attached - interrupt to quit futex(0xb7e573cc, FUTEX_WAIT, 2, NULL^X <unfinished ...> gdb: balancedev2:~# gdb -p 11107 GNU gdb 6.3-debian Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-linux". Attaching to process 11107 Using host libthread_db library "/lib/tls/libthread_db.so.1". warning: could not load vsyscall page because no executable was specified warning: try using the "file" command first Reading symbols from /usr/sbin/sshd...(no debugging symbols found)...done. Reading symbols from /lib/libwrap.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libwrap.so.0 Reading symbols from /lib/libpam.so.0... (no debugging symbols found)...done. Loaded symbols for /lib/libpam.so.0 Reading symbols from /lib/tls/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/tls/libdl.so.2 Reading symbols from /lib/tls/libresolv.so.2... (no debugging symbols found)...done. Loaded symbols for /lib/tls/libresolv.so.2 Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.7...(no debugging symbols found)...done. Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.7 Reading symbols from /lib/tls/libutil.so.1... (no debugging symbols found)...done. Loaded symbols for /lib/tls/libutil.so.1 Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /lib/tls/libnsl.so.1... (no debugging symbols found)...done. Loaded symbols for /lib/tls/libnsl.so.1 Reading symbols from /lib/tls/libcrypt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/tls/libcrypt.so.1 Reading symbols from /lib/tls/libpthread.so.0... (no debugging symbols found)...done. [Thread debugging using libthread_db enabled] [New Thread -1210984832 (LWP 11107)] Loaded symbols for /lib/tls/libpthread.so.0 Reading symbols from /lib/tls/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/tls/libc.so.6 Reading symbols from /lib/ld-linux.so.2... (no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/tls/libnss_compat.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/tls/libnss_compat.so.2 Reading symbols from /lib/tls/libnss_nis.so.2... (no debugging symbols found)...done. Loaded symbols for /lib/tls/libnss_nis.so.2 Reading symbols from /lib/tls/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/tls/libnss_files.so.2 Reading symbols from /lib/tls/libnss_dns.so.2... (no debugging symbols found)...done. Loaded symbols for /lib/tls/libnss_dns.so.2 0xb7e077a1 in pthread_setcanceltype () from /lib/tls/libc.so.6 (gdb) bt #0 0xb7e077a1 in pthread_setcanceltype () from /lib/tls/libc.so.6 #1 0x00000001 in ?? () #2 0xb7e55fcc in ?? () from /lib/tls/libc.so.6 #3 0x00000001 in ?? () #4 0xb7df7523 in setlogmask () from /lib/tls/libc.so.6 #5 0xbfffdd74 in ?? () #6 0xb7df74a0 in setlogmask () from /lib/tls/libc.so.6 #7 0x00000000 in ?? () #8 0x00000001 in ?? () #9 0x00000000 in ?? () #10 0xbfffe174 in ?? () #11 0xbfffdd74 in ?? () #12 0x00000007 in ?? () #13 0xbfffe58c in ?? () #14 0x0806f3ac in error () #15 0x0806f150 in error () #16 0x08053301 in ?? () #17 0x0807ff45 in _IO_stdin_used () #18 0xb7e46d5b in in6addr_loopback () from /lib/tls/libc.so.6 #19 0xbfffea6c in ?? () #20 0xbfffe63c in ?? () #21 0x00000009 in ?? () #22 0x00000000 in ?? () ---Type <return> to continue, or q <return> to quit--- #23 0x00000008 in ?? () #24 <signal handler called> #25 0xb7dfba1c in send () from /lib/tls/libc.so.6 #26 0xb7df6e10 in vsyslog () from /lib/tls/libc.so.6 #27 0xb7df6aaf in syslog () from /lib/tls/libc.so.6 #28 0x0806f3c1 in error () #29 0x0806f150 in error () #30 0x08054e27 in ?? () #31 0x080800d4 in _IO_stdin_used () #32 0x00000005 in ?? () #33 0xbffff358 in ?? () #34 0x08053d5f in ?? () #35 0x0808f334 in stdin () #36 0x080532e0 in ?? () #37 0x00000000 in ?? () #38 0x00000000 in ?? () #39 0x08092c20 in stdin () #40 0x00000000 in ?? () #41 0xbffff358 in ?? () #42 0x00000000 in ?? () #43 0x00000000 in ?? () #44 0x00000000 in ?? () #45 0x00000000 in ?? () ---Type <return> to continue, or q <return> to quit--- #46 0x00002b65 in ?? () #47 0x08092c20 in stdin () #48 0xbffff380 in ?? () #49 0xbffff398 in ?? () #50 0x0805793a in ?? () #51 0x00002b65 in ?? () #52 0x00000005 in ?? () #53 0x00000005 in ?? () #54 0x00000007 in ?? () #55 0x080965c0 in ?? () #56 0x0809bc62 in ?? () #57 0x00000006 in ?? () #58 0x00000007 in ?? () #59 0x00000004 in ?? () #60 0x00000005 in ?? () #61 0xbffff3a8 in ?? () #62 0x080965c0 in ?? () #63 0x08092c20 in stdin () #64 0x00000000 in ?? () #65 0xbffff3b8 in ?? () #66 0x08057dcc in ?? () #67 0x08092c20 in stdin () #68 0x080965c0 in ?? () ---Type <return> to continue, or q <return> to quit--- #69 0xbffff3c4 in ?? () #70 0x080723e9 in error () Previous frame inner to this frame (corrupt stack?) Dimi Hrm...
> Reading symbols from /lib/tls/libpthread.so.0...
It looks like Debian built sshd with the pthread hack, which is unsupported (and
opens a whole other can of worms).
Can you reproduce it with 3.9p1 with "UsePAM no" in sshd_config?
Update: only ssh -1 is brocken ( many command only keys are ssh1 ) ssh -2 is Ok. That's interesting. Could you please attach (note: please use "create
attachment", don't paste into the text field) your sshd_config.
Also, could you please try running the server in debug mode ("/path/to/sshd
-ddde -p 2022" then connect to port 2022) and if you can reproduce the problem
with it in debug mode, please attach that log separately.
Created attachment 761 [details]
My sshd_config
this prblem can't be reproduced in debug mode. My sshd_config is now in attacment. OK, if it can't be provoked in debug mode then the other option is to kick the debugging up to DEBUG3 and get the messages from syslog. I see that you already have the LogLevel at DEBUG, which may have some clues. Could you grep a failing session out by pid and attach that? Also, if you kill one of the "sshd: root@notty" processes does the "defunct" process vanish too? I've seen defunct processes on Linux wedge unkillably, usually when they get straced. Created attachment 762 [details]
Logs from parent
Yes, if i kill the parent process, then the child zombie dies too. SINCHLD handler buggy? It could be a problem with the SIGCHLD handler but I have a feeling it's some kind of race. Do you see the problem running, say, "sleep 1" compared to "true" via ssh ? I'm guessing that if the command exits *really* fast then the SIGCHLD might be delivered before the handler is set up. If that's true then I would expect the sleep's to be problem free but the true's to exhibit the problem. Bingo! With "sleep 1" sshd works Ok. first commando with true make a zombie. Cause this issue i never got problem with -ddd. Created attachment 763 [details]
check for pending child after setting up handler
Here's a patch to try, against 3.9p1. It will (assuming I got it right :-)
check for a pending child and set the appropriate flags immediately after
setting up the SIGCHLD handler.
I would guess that the reason it doesn't happen with debugging on is that the
debugging changes the timing enough for it miss the window.
Created attachment 764 [details]
check for pending child after setting up handler
Oops, looks like I didn't get it right after all... Try this one instead.
Hi, i tried to patch 3.8p1 with second patch. Same problem withc /bin/true as command. I will try 3.9p1 tommorow, but interesting is it for 3.8p1 because it is default in debian stable/testing. So i think debian maintainer mus backport your patch too. If the patch didn't help then the problem is probably elsewhere. Could you please bump LogLevel to DEBUG3 and grep a failing session out of syslog as mentioned earlier? Created attachment 767 [details]
hangs with DEBUG3
Hi, I'v patched 3.8p1 with second patch and get on client everytimes an error now: Received disconnect from 10.0.0.3: wait: Bad file descriptor So I understand you: with patch #764 you get that error but it doesn't hang? If so then we're probably on the the right track, but my patch isn't quite right. Please double-check that your're using the patch in attachment #764 [details]? I saw that
bogus disconnect error with the older patch but I can't reproduce it with #764.
I will now compile new package with Patch #764 and try it again. Ok, now is problem not everytime but not solved complete: this is Ok: host:~\> time for I in `seq 1 100`; do ssh -1 root@balancedev2 true; done real 1m19.477s user 0m4.682s sys 0m2.609s but if i run same commando from another shell parallel, then first child on server get a zombie when 2. command was started. Very strange! So far I haven't been able to reproduce this. What can you tell me about the host itself? Is it fast or slow? Single/multiple CPU? Hi,
this machine isn't very fast:
balancedev2:~# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 11
model name : Intel(R) Pentium(R) III CPU family 1133MHz
stepping : 1
cpu MHz : 1130.892
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse
bogomips : 2220.03
balancedev2:~# free
total used free shared buffers cached
Mem: 506788 471804 34984 0 113852 277188
-/+ buffers/cache: 80764 426024
Swap: 208836 0 208836
On another server same issue with patch too. Only DEBUG3 solves this problem. Hi, could you verify this issue? Beacause tis Bug we cann't sitch to 2.6-er Kernel. We have a lot of cmmand-only keys -:( Dimi Are you patching Debian's OpenSSH, or are you building from the sources that we release? Hi, i'v patched and recompiled debian package. Dimi Please rebuild from our sources rather than the Debian package. Debian make changes, like enabling threads, that we specifically warn are dangerous and unsupported. We cannot support the Debian package. HI, i'v recompiled sshd from original source whith: ./configure --prefix=/usr --sysconfdir=/etc/ssh --libexecdir=/usr/lib --mandir=/usr/share/man --with-tcp-wrappers --with-xauth=/usr/bin/X11/xauth --with-default-path=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin --with-superuser-path=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin --with-pam --with-4in6 --with-privsep-path=/var/run/sshd --without-rand-helper Patch was added to, but it is not solves problem with '/bin/true' as command. I got lot of zombies again Hi, it seems happens only if ssh ist compiled wit '--with-pam' and with -DUSE_POSIX_THREADS ( debian defualt ). Without this defines fails ssh 3.8.p1 not too. Dimi Hmmm... I have installed Debian-SSH ( without modifikation ). It seems to happens only with LogLevel >= DEBUG. LogLevel <= VERBOSE is Ok. Dimi Please turn *off* threads in your build. Like I said: they are completely unsupported. Threads are unsupported - don't use them, we don't accept bug report against them. Change all RESOLVED bug to CLOSED with the exception of the ones fixed post-4.4. |