Bug 967 - Command only sessions hangs on target system.
Summary: Command only sessions hangs on target system.
Status: CLOSED INVALID
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: sshd (show other bugs)
Version: 3.9p1
Hardware: All Linux
: P2 normal
Assignee: OpenSSH Bugzilla mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-12-29 20:27 AEDT by Dimitrij Hilt
Modified: 2006-10-07 11:38 AEST (History)
0 users

See Also:


Attachments
My sshd_config (1.69 KB, text/plain)
2005-01-10 21:03 AEDT, Dimitrij Hilt
no flags Details
Logs from parent (1.93 KB, text/plain)
2005-01-10 22:07 AEDT, Dimitrij Hilt
no flags Details
check for pending child after setting up handler (1.07 KB, patch)
2005-01-10 23:26 AEDT, Darren Tucker
no flags Details | Diff
check for pending child after setting up handler (1.16 KB, patch)
2005-01-10 23:47 AEDT, Darren Tucker
no flags Details | Diff
hangs with DEBUG3 (7.60 KB, text/plain)
2005-01-11 18:36 AEDT, Dimitrij Hilt
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dimitrij Hilt 2004-12-29 20:27:53 AEDT
on Debian Linux ( sarge ) with kernel 2.6.9 hangs a non privileged thread from
sshd if esecuted command returns. Not every request hangs, but a lot:
 4683 ?        Ss     0:00 /usr/sbin/sshd
 6295 ?        Ss     0:00  \_ sshd: root@notty
 6297 ?        Zs     0:00  |   \_ [check_dpt] <defunct>
 8000 ?        Ss     0:00  \_ sshd: root@notty
 8002 ?        Zs     0:00  |   \_ [check_dpt] <defunct>
 8048 ?        Ss     0:00  \_ sshd: root@notty
 8050 ?        Zs     0:00  |   \_ [check_dpt] <defunct>
 8063 ?        Ss     0:00  \_ sshd: root@notty
 8065 ?        Zs     0:00  |   \_ [check_dpt] <defunct>
 8078 ?        Ss     0:00  \_ sshd: root@notty
 8080 ?        Zs     0:00  |   \_ [check_dpt] <defunct>
 8098 ?        Ss     0:00  \_ sshd: root@notty
 8100 ?        Zs     0:00      \_ [check_dpt] <defunct>

Dimitrij
Comment 1 Darren Tucker 2005-01-06 10:22:52 AEDT
Is this built with PAM?  Does the problem occur with 3.9p1?

There was a bug that was fixed in 3.9p1 relating the the handling of SIGCHLD in
the PAM code which could possibly be the cause of this.
Comment 2 Dimitrij Hilt 2005-01-10 18:43:23 AEDT
It is standard debian sarge ( testing ) sshd an was build witch PAM. It is 3.4p1.
Comment 3 Damien Miller 2005-01-10 18:48:03 AEDT
3.4p1 is 2.5 years old. Please try to reproduce this problem with 3.9p1, or you
can take the bug up with your OS vendor if they refuse to provide a non-ancient
version.
Comment 4 Dimitrij Hilt 2005-01-10 18:51:37 AEDT
why does 'UsePAM no' in sshd_config not solve this problem?
Comment 5 Darren Tucker 2005-01-10 19:12:51 AEDT
The UsePAM sshd_config(5) directive was introduced in 3.7p1.  Prior to that, it
was a compile-time directive only (configure --with-pam).

That said, the bug I was referring to was introduced around 3.8ish, so it can't
be the cause of your problem.

Can you reproduce the problem with 3.9p1?  If not, please close this bug and
report it to Debian.
Comment 6 Dimitrij Hilt 2005-01-10 19:54:55 AEDT
This problem still exists with 3.9p1-1 from debian/experimental too.

strace:
Process 11107 attached - interrupt to quit
futex(0xb7e573cc, FUTEX_WAIT, 2, NULL^X <unfinished ...>

gdb:
balancedev2:~# gdb -p 11107
GNU gdb 6.3-debian
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-linux".
Attaching to process 11107
Using host libthread_db library "/lib/tls/libthread_db.so.1".

warning: could not load vsyscall page because no executable was specified

warning: try using the "file" command first
Reading symbols from /usr/sbin/sshd...(no debugging symbols found)...done.
Reading symbols from /lib/libwrap.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libwrap.so.0
Reading symbols from /lib/libpam.so.0...
(no debugging symbols found)...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /lib/tls/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libdl.so.2
Reading symbols from /lib/tls/libresolv.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libresolv.so.2
Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.7...(no debugging
symbols found)...done.
Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.7
Reading symbols from /lib/tls/libutil.so.1...
(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libutil.so.1
Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/tls/libnsl.so.1...
(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libnsl.so.1
Reading symbols from /lib/tls/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libcrypt.so.1
Reading symbols from /lib/tls/libpthread.so.0...
(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread -1210984832 (LWP 11107)]
Loaded symbols for /lib/tls/libpthread.so.0
Reading symbols from /lib/tls/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/tls/libnss_compat.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/tls/libnss_compat.so.2
Reading symbols from /lib/tls/libnss_nis.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libnss_nis.so.2
Reading symbols from /lib/tls/libnss_files.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/tls/libnss_files.so.2
Reading symbols from /lib/tls/libnss_dns.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libnss_dns.so.2
0xb7e077a1 in pthread_setcanceltype ()
   from /lib/tls/libc.so.6
(gdb) bt
#0  0xb7e077a1 in pthread_setcanceltype () from /lib/tls/libc.so.6
#1  0x00000001 in ?? ()
#2  0xb7e55fcc in ?? () from /lib/tls/libc.so.6
#3  0x00000001 in ?? ()
#4  0xb7df7523 in setlogmask () from /lib/tls/libc.so.6
#5  0xbfffdd74 in ?? ()
#6  0xb7df74a0 in setlogmask () from /lib/tls/libc.so.6
#7  0x00000000 in ?? ()
#8  0x00000001 in ?? ()
#9  0x00000000 in ?? ()
#10 0xbfffe174 in ?? ()
#11 0xbfffdd74 in ?? ()
#12 0x00000007 in ?? ()
#13 0xbfffe58c in ?? ()
#14 0x0806f3ac in error ()
#15 0x0806f150 in error ()
#16 0x08053301 in ?? ()
#17 0x0807ff45 in _IO_stdin_used ()
#18 0xb7e46d5b in in6addr_loopback () from /lib/tls/libc.so.6
#19 0xbfffea6c in ?? ()
#20 0xbfffe63c in ?? ()
#21 0x00000009 in ?? ()
#22 0x00000000 in ?? ()
---Type <return> to continue, or q <return> to quit---
#23 0x00000008 in ?? ()
#24 <signal handler called>
#25 0xb7dfba1c in send () from /lib/tls/libc.so.6
#26 0xb7df6e10 in vsyslog () from /lib/tls/libc.so.6
#27 0xb7df6aaf in syslog () from /lib/tls/libc.so.6
#28 0x0806f3c1 in error ()
#29 0x0806f150 in error ()
#30 0x08054e27 in ?? ()
#31 0x080800d4 in _IO_stdin_used ()
#32 0x00000005 in ?? ()
#33 0xbffff358 in ?? ()
#34 0x08053d5f in ?? ()
#35 0x0808f334 in stdin ()
#36 0x080532e0 in ?? ()
#37 0x00000000 in ?? ()
#38 0x00000000 in ?? ()
#39 0x08092c20 in stdin ()
#40 0x00000000 in ?? ()
#41 0xbffff358 in ?? ()
#42 0x00000000 in ?? ()
#43 0x00000000 in ?? ()
#44 0x00000000 in ?? ()
#45 0x00000000 in ?? ()
---Type <return> to continue, or q <return> to quit---
#46 0x00002b65 in ?? ()
#47 0x08092c20 in stdin ()
#48 0xbffff380 in ?? ()
#49 0xbffff398 in ?? ()
#50 0x0805793a in ?? ()
#51 0x00002b65 in ?? ()
#52 0x00000005 in ?? ()
#53 0x00000005 in ?? ()
#54 0x00000007 in ?? ()
#55 0x080965c0 in ?? ()
#56 0x0809bc62 in ?? ()
#57 0x00000006 in ?? ()
#58 0x00000007 in ?? ()
#59 0x00000004 in ?? ()
#60 0x00000005 in ?? ()
#61 0xbffff3a8 in ?? ()
#62 0x080965c0 in ?? ()
#63 0x08092c20 in stdin ()
#64 0x00000000 in ?? ()
#65 0xbffff3b8 in ?? ()
#66 0x08057dcc in ?? ()
#67 0x08092c20 in stdin ()
#68 0x080965c0 in ?? ()
---Type <return> to continue, or q <return> to quit---
#69 0xbffff3c4 in ?? ()
#70 0x080723e9 in error ()
Previous frame inner to this frame (corrupt stack?)

Dimi
Comment 7 Darren Tucker 2005-01-10 20:07:25 AEDT
Hrm...
> Reading symbols from /lib/tls/libpthread.so.0...

It looks like Debian built sshd with the pthread hack, which is unsupported (and
opens a whole other can of worms).

Can you reproduce it with 3.9p1 with "UsePAM no" in sshd_config?
Comment 8 Dimitrij Hilt 2005-01-10 20:41:48 AEDT
Update:

only ssh -1 is brocken ( many command only keys are ssh1 ) ssh -2 is Ok.
Comment 9 Darren Tucker 2005-01-10 20:53:40 AEDT
That's interesting.  Could you please attach (note: please use "create
attachment", don't paste into the text field) your sshd_config.

Also, could you please try running the server in debug mode ("/path/to/sshd
-ddde -p 2022" then connect to port 2022) and if you can reproduce the problem
with it in debug mode, please attach that log separately.
Comment 10 Dimitrij Hilt 2005-01-10 21:03:51 AEDT
Created attachment 761 [details]
My sshd_config
Comment 11 Dimitrij Hilt 2005-01-10 21:04:43 AEDT
this prblem can't be reproduced in debug mode. My sshd_config is now in attacment.
Comment 12 Darren Tucker 2005-01-10 21:57:57 AEDT
OK, if it can't be provoked in debug mode then the other option is to kick the
debugging up to DEBUG3 and get the messages from syslog.

I see that you already have the LogLevel at DEBUG, which may have some clues. 
Could you grep a failing session out by pid and attach that?

Also, if you kill one of the "sshd: root@notty" processes does the "defunct"
process vanish too?  I've seen defunct processes on Linux wedge unkillably,
usually when they get straced.
Comment 13 Dimitrij Hilt 2005-01-10 22:07:30 AEDT
Created attachment 762 [details]
Logs from parent
Comment 14 Dimitrij Hilt 2005-01-10 22:09:07 AEDT
Yes, if i kill the parent process, then the child zombie dies too. SINCHLD
handler buggy?
Comment 15 Darren Tucker 2005-01-10 22:30:50 AEDT
It could be a problem with the SIGCHLD handler but I have a feeling it's some
kind of race.

Do you see the problem running, say, "sleep 1" compared to "true" via ssh ?

I'm guessing that if the command exits *really* fast then the SIGCHLD might be
delivered before the handler is set up.  If that's true then I would expect the
sleep's to be problem free but the true's to exhibit the problem.
Comment 16 Dimitrij Hilt 2005-01-10 23:07:45 AEDT
Bingo! With "sleep 1" sshd works Ok. first commando with true make a zombie.
Cause this issue i never got problem with -ddd.
Comment 17 Darren Tucker 2005-01-10 23:26:31 AEDT
Created attachment 763 [details]
check for pending child after setting up handler

Here's a patch to try, against 3.9p1.  It will (assuming I got it right :-)
check for a pending child and set the appropriate flags immediately after
setting up the SIGCHLD handler.

I would guess that the reason it doesn't happen with debugging on is that the
debugging changes the timing enough for it miss the window.
Comment 18 Darren Tucker 2005-01-10 23:47:23 AEDT
Created attachment 764 [details]
check for pending child after setting up handler

Oops, looks like I didn't get it right after all...  Try this one instead.
Comment 19 Dimitrij Hilt 2005-01-11 06:39:27 AEDT
Hi,

i tried to patch 3.8p1 with second patch. Same problem withc /bin/true as command.

I will try 3.9p1 tommorow, but interesting is it for 3.8p1 because it is default
in debian stable/testing. So i think debian maintainer mus backport your patch too. 
Comment 20 Darren Tucker 2005-01-11 09:39:46 AEDT
If the patch didn't help then the problem is probably elsewhere.  Could you
please bump LogLevel to DEBUG3 and grep a failing session out of syslog as
mentioned earlier?
Comment 21 Dimitrij Hilt 2005-01-11 18:36:05 AEDT
Created attachment 767 [details]
hangs with DEBUG3
Comment 22 Dimitrij Hilt 2005-01-11 18:37:18 AEDT
Hi,

I'v patched 3.8p1 with second patch and get on client everytimes an error now:
Received disconnect from 10.0.0.3: wait: Bad file descriptor
Comment 23 Darren Tucker 2005-01-11 19:00:31 AEDT
So I understand you: with patch #764 you get that error but it doesn't hang?  If
so then we're probably on the the right track, but my patch isn't quite right.
Comment 24 Darren Tucker 2005-01-11 19:13:40 AEDT
Please double-check that your're using the patch in attachment #764 [details]?  I saw that
bogus disconnect error with the older patch but I can't reproduce it with #764. 
Comment 25 Dimitrij Hilt 2005-01-11 20:13:26 AEDT
I will now compile new package with Patch #764 and try it again.
Comment 26 Dimitrij Hilt 2005-01-11 20:25:49 AEDT
Ok,

now is problem not everytime but not solved complete:

this is Ok:

host:~\> time for I in `seq 1 100`; do ssh -1 root@balancedev2  true; done

real    1m19.477s
user    0m4.682s
sys     0m2.609s


but if i run same commando from another shell parallel, then first child on
server get a zombie when 2. command was started.

Very strange!

Comment 27 Darren Tucker 2005-01-19 17:06:54 AEDT
So far I haven't been able to reproduce this.  What can you tell me about the
host itself?  Is it fast or slow?  Single/multiple CPU?
Comment 28 Dimitrij Hilt 2005-01-20 02:54:01 AEDT
Hi,

this machine isn't very fast:

balancedev2:~# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 11
model name      : Intel(R) Pentium(R) III CPU family      1133MHz
stepping        : 1
cpu MHz         : 1130.892
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 mmx fxsr sse
bogomips        : 2220.03

balancedev2:~# free
             total       used       free     shared    buffers     cached
Mem:        506788     471804      34984          0     113852     277188
-/+ buffers/cache:      80764     426024
Swap:       208836          0     208836
Comment 29 Dimitrij Hilt 2005-01-22 09:39:04 AEDT
On another server same issue with patch too. Only DEBUG3 solves this problem.
Comment 30 Dimitrij Hilt 2005-02-16 10:33:28 AEDT
Hi,

could you verify this issue? Beacause tis Bug we cann't sitch to 2.6-er Kernel.
We have a lot of cmmand-only keys -:(

Dimi
Comment 31 Damien Miller 2005-02-16 11:13:44 AEDT
Are you patching Debian's OpenSSH, or are you building from the sources that we
release?
Comment 32 Dimitrij Hilt 2005-02-17 02:33:35 AEDT
Hi,

i'v patched and recompiled debian package.

Dimi
Comment 33 Damien Miller 2005-02-17 07:55:15 AEDT
Please rebuild from our sources rather than the Debian package. Debian make
changes, like enabling threads, that we specifically warn are dangerous and
unsupported.

We cannot support the Debian package.
Comment 34 Dimitrij Hilt 2005-02-18 10:14:16 AEDT
HI,

i'v recompiled sshd from original source whith:
./configure --prefix=/usr --sysconfdir=/etc/ssh --libexecdir=/usr/lib
--mandir=/usr/share/man --with-tcp-wrappers --with-xauth=/usr/bin/X11/xauth
--with-default-path=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin
--with-superuser-path=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin
--with-pam --with-4in6 --with-privsep-path=/var/run/sshd --without-rand-helper

Patch was added to, but it is not solves problem with '/bin/true' as command. I
got lot of zombies again
Comment 35 Dimitrij Hilt 2005-03-07 21:17:24 AEDT
Hi,

it seems happens only if ssh ist compiled wit '--with-pam' and with
-DUSE_POSIX_THREADS ( debian defualt ).

Without this defines fails ssh 3.8.p1 not too.

Dimi
Comment 36 Dimitrij Hilt 2005-03-07 21:59:00 AEDT
Hmmm...

I have installed Debian-SSH ( without modifikation ). It seems to happens only
with LogLevel >= DEBUG. LogLevel <= VERBOSE is Ok.

Dimi
Comment 37 Damien Miller 2005-04-21 16:03:23 AEST
Please turn *off* threads in your build. Like I said: they are completely
unsupported.
Comment 38 Damien Miller 2005-06-21 13:17:02 AEST
Threads are unsupported - don't use them, we don't accept bug report against them.
Comment 39 Darren Tucker 2006-10-07 11:38:21 AEST
Change all RESOLVED bug to CLOSED with the exception of the ones fixed post-4.4.