Bug 1085 - Intermittent ssh core dumps
Summary: Intermittent ssh core dumps
Status: CLOSED INVALID
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: 4.2p1
Hardware: SPARC Solaris
: P2 normal
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-09-13 16:39 AEST by Jeroen Scheerder
Modified: 2008-04-04 09:55 AEDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jeroen Scheerder 2005-09-13 16:39:55 AEST
I get intermirttent core dumps after installing/deploying 4.2p1.  4.1p1 was (and
is) still working fine.

Here's a backtrace in gdb:

(gdb) bt
#0  0x45a94 in mkstemp64 ()
#1  0x801c4 in mkstemp64 ()
#2  0x80074 in mkstemp64 ()
#3  0x80f00 in mkstemp64 ()
#4  0x836f8 in mkstemp64 ()
#5  0x7ecec in mkstemp64 ()
#6  0x49070 in mkstemp64 ()
#7  0x48e34 in mkstemp64 ()
#8  0x36340 in _init ()
#9  0x3496c in _init ()
#10 0x31aa0 in _init ()
#11 0x31240 in _init ()
#12 0x1d508 in _init ()
#13 0x1b310 in _init ()
#14 0x13f3c in _init ()

OS is Solaris 7, running on Sparc.  OpenSSH was configured as follows:

/phil/sw/src/openssh-4.2p1/configure \
        --prefix=/phil/sw/sunos/sparc/pkg/openssh-4.2p1 \
        --sysconfdir=/phil/etc/openssh \
        --without-rsh \
        --with-pid-dir=/phil/var/run \
        --with-ssl-dir=/phil/sw/sunos/sparc/pkg/openssl-0.9.8 \
        --with-cppflags="-I/phil/sw/sunos/sparc/pkg/zlib-1.2.3/include
-I/phil/src/tcpwrappers-7.6" \
        --with-ldflags="-L/phil/sw/sunos/sparc/pkg/zlib-1.2.3/lib
-L/phil/src/tcpwrappers-7.6" \
        --with-default-path=/usr/bin:/bin:/phil/sw/sunos/sparc/bin \
        --with-tcp-wrappers \
        --with-skey=/phil/sw/pkg/skey-1.1.5 \
        --with-privsep-user=accessy \
        --with-privsep-path=/phil/var/prison
Comment 1 Damien Miller 2005-09-13 19:09:27 AEST
We might be able to help you if you tell us what you are doing when you get
those coredumps. Also rebuild with debugging enabled, as a debugless trace
doesn't tell us much.
Comment 2 Jeroen Scheerder 2005-09-13 19:23:56 AEST
(In reply to comment #1)
> We might be able to help you if you tell us what you are doing when you get
> those coredumps. Also rebuild with debugging enabled, as a debugless trace
> doesn't tell us much.

What I'm doing is:

$ ssh <host>

Then: core dump, about one in four tries.

I'll rebuild with debugging at earliest convenience.
Comment 3 Damien Miller 2005-09-19 21:48:30 AEST
Also, which compiler (and version) are you using?
Comment 4 Jeroen Scheerder 2005-09-24 03:35:31 AEST
(In reply to comment #3)
> Also, which compiler (and version) are you using?

s@goedel:pts/0(9) gcc -v                                               ~ 19:34
Using built-in specs.
Target: sparc-sun-solaris2.7
Configured with: /phil/sw/src/gcc-4.0.1/configure
--prefix=/phil/sw/sunos/sparc/pkg/gcc-4.0.1 --disable-libgcj
--enable-languages=c,c++,objc --with-gnu-as
--with-as=/phil/sw/sunos/sparc/bin/as --with-gnu-ld
--with-ld=/phil/sw/sunos/sparc/bin/ld --enable-shared
Thread model: posix
gcc version 4.0.1
Comment 5 Damien Miller 2005-09-24 07:55:48 AEST
Could you try a different compiler? gcc 4.x appears generate broken code on
quite a few platforms, e.g. bug #1080
Comment 6 Darren Tucker 2005-09-24 10:08:28 AEST
And when you do, please make sure you recompile any of the prereqs that were
compiled with the newer compiler (esp. openssl but zlib too).

You might also want to run openssl's self-test ("make tests") after you build it.

People have also reported problems with openssl 0.9.8 but I'm not sure if those
were compiler-related or not.
Comment 7 Christian Walther 2005-09-26 20:38:06 AEST
Hi there,

I have the same problem when connecting with
OpenSSH_4.2p1, OpenSSL 0.9.8 05 Jul 2005

The compiler is a gcc version 3.4.3 (csl-sol210-3_4-branch+sol_rpath). I
followed this bug and rebuild zlib and libopenssl.
Solaris Version is 5.10 Generic_118822-02 sun4u sparc SUNW,Sun-Fire-V240. 

I executed "make tests" a couple of times, and had a number of Segfaults. The
last run produced the following output:

run test exit-status.sh ...
test remote exit status: proto 1 status 0
test remote exit status: proto 1 status 1
test remote exit status: proto 1 status 4
test remote exit status: proto 1 status 5
test remote exit status: proto 1 status 44
test remote exit status: proto 2 status 0
Write failed: Broken pipe
exit code (with sleep) mismatch for protocol 2: 255 != 0
test remote exit status: proto 2 status 1
Segmentation Fault - core dumped
exit code mismatch for protocol 2: 139 != 1
Segmentation Fault - core dumped
exit code (with sleep) mismatch for protocol 2: 139 != 1
test remote exit status: proto 2 status 4
Write failed: Broken pipe
exit code mismatch for protocol 2: 255 != 4
Segmentation Fault - core dumped
exit code (with sleep) mismatch for protocol 2: 139 != 4
test remote exit status: proto 2 status 5
test remote exit status: proto 2 status 44
failed remote exit status
make[1]: *** [t-exec] Error 1
make[1]: Leaving directory `/opt/gad/sources/openssh-4.2p1/regress'
make: *** [tests] Error 2


Running "ssh -vvv" produced the following output:
OpenSSH_4.2p1, OpenSSL 0.9.8 05 Jul 2005
debug1: Reading configuration data /etc/ssh/ssh_config
debug2: ssh_connect: needpriv 0
debug1: Connecting to gszulg01 [10.64.10.84] port 22.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: identity file /.ssh/identity type -1
debug1: identity file /.ssh/id_rsa type -1
debug1: identity file /.ssh/id_dsa type -1
debug1: Remote protocol version 1.99, remote software version OpenSSH_4.1
debug1: match: OpenSSH_4.1 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_4.2
debug2: fd 4 setting O_NONBLOCK
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug2: kex_parse_kexinit:
diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1
debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
debug2: kex_parse_kexinit:
aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour128,arcfour256,arcfour,aes192-cbc,aes256-cbc,rijndael-cbc@lysator.liu.se,aes128-ctr,aes192-ctr,aes256-ctr
debug2: kex_parse_kexinit:
aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour128,arcfour256,arcfour,aes192-cbc,aes256-cbc,rijndael-cbc@lysator.liu.se,aes128-ctr,aes192-ctr,aes256-ctr
debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,hmac-ripemd160,hmac-ripemd160@openssh.com,hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,hmac-ripemd160,hmac-ripemd160@openssh.com,hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit: none,zlib@openssh.com,zlib
debug2: kex_parse_kexinit: none,zlib@openssh.com,zlib
debug2: kex_parse_kexinit: 
debug2: kex_parse_kexinit: 
debug2: kex_parse_kexinit: first_kex_follows 0 
debug2: kex_parse_kexinit: reserved 0 
debug2: kex_parse_kexinit:
diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1
debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
debug2: kex_parse_kexinit:
aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour,aes192-cbc,aes256-cbc,rijndael-cbc@lysator.liu.se,aes128-ctr,aes192-ctr,aes256-ctr
debug2: kex_parse_kexinit:
aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour,aes192-cbc,aes256-cbc,rijndael-cbc@lysator.liu.se,aes128-ctr,aes192-ctr,aes256-ctr
debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,hmac-ripemd160,hmac-ripemd160@openssh.com,hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,hmac-ripemd160,hmac-ripemd160@openssh.com,hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit: none,zlib
debug2: kex_parse_kexinit: none,zlib
debug2: kex_parse_kexinit: 
debug2: kex_parse_kexinit: 
debug2: kex_parse_kexinit: first_kex_follows 0 
debug2: kex_parse_kexinit: reserved 0 
debug2: mac_init: found hmac-md5
debug1: kex: server->client aes128-cbc hmac-md5 none
debug2: mac_init: found hmac-md5
debug1: kex: client->server aes128-cbc hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
Segmentation Fault (core dumped)

Backtrace is as follows:
# adb core
core file = core -- program ``/usr/bin/ssh'' on platform SUNW,Sun-Fire-V240
SIGSEGV: Segmentation Fault
$C
ffbfee20 bn_sub_words+0x3c(16b850, 16b3e0, 16b400, 7, 1, 9da20)
ffbfee90 bn_mul_recursive+0x40c(1, 20, 0, 10, 0, ffffffff)
ffbfef10 bn_mul_recursive+0x2e4(1, 40, 0, 20, 0, ffffffff)
ffbfef90 bn_mul_recursive+0x2e4(1, 80, 0, 40, 0, ffffffff)
ffbff010 BN_mul+0x2c4(159634, 16b530, 15960c, 159820, 2, 1)
ffbff088 BN_mod_mul_montgomery+0x3c(0, 1595f8, 15960c, 159858, 159820, 80)
ffbff0f8 BN_mod_exp_mont_consttime+0x56c(1595f8, 16b320, 100, d, 159820, 159858)
ffbff180 BN_mod_exp_mont+0x70(156308, 1562a8, ffbff2e0, 156288, 159820, 159858)
ffbff278 generate_key+0x94(15b7f0, 20, 1562e8, 0, 43, 149360)
ffbff308 DH_generate_key+0xc(15b7f0, 1562e8, 20, 0, c3, 0)
ffbff378 dh_gen_key+0x7c(15b7f0, 100, 1f, 7e0, ff000, ff)
ffbff3e8 kexgex_client+0x174(1586d0, 400, 916c8, 4e2fc, 2000, 1000)
ffbff488 kex_input_kexinit+0x5fc(1, 6, 1586d0, 158098, 169c10, 1586e0)
ffbff500 dispatch_run+0x94(0, 158714, 1586d0, 156248, 52ddc, 14e400)
ffbff578 ssh_kex2+0x17c(163688, 140c00, ffbff764, 15625c, 1, 0)
ffbff5e8 ssh_login+0x334(5, ffbff850, 4, 4, 1538b0, 152000)
ffbff860 main+0xce8(152064, 161ca0, 151c00, 151800, 153f48, 153400)
ffbffb20 _start+0x5c(0, 0, 0, 0, 0, 0)

Regards,
Christian

PS: Sorry for asking, but I searched the documentation, the net and even looked
at the configure script, but I didn't find a clue of how to enable debugging
during compile time. Did I miss something, and if so, could you advice of how to
enable debugging?
Comment 8 Darren Tucker 2005-09-26 21:10:12 AEST
(In reply to comment #7)
> core file = core -- program ``/usr/bin/ssh'' on platform SUNW,Sun-Fire-V240
> SIGSEGV: Segmentation Fault
> $C
> ffbfee20 bn_sub_words+0x3c(16b850, 16b3e0, 16b400, 7, 1, 9da20)

Looks like a problem with OpenSSL (the trace certainly points there).  Did
OpenSSL's self-test ("make tests") pass?  Does the same problem occur with
openssl-0.9.7g?

[...]
> PS: Sorry for asking, but I searched the documentation, the net and even looked
> at the configure script, but I didn't find a clue of how to enable debugging
> during compile time. Did I miss something, and if so, could you advice of how
> to enable debugging?

Debug symbols?  Depends on your compiler, but for gcc it's automatically enabled
(the "-g" flag).  If it's not, then pass the appropriate flag via --with-cflags, eg:
./configure --with-cflags=-g

Note that by default, those symbols are stripped out in the installed binaries
(ie you should use the compiled files in your build dir for debugging with gdb,
adb or similar).
Comment 9 Jeroen Scheerder 2005-09-27 05:28:43 AEST
(1) Looks like an OpenSSL 0.9.8 issue to me.  Does not happen with 0.9.7g.
    0.9.8's "make test" was unproblematic, though.

(2) Built OpenSSH with the "-g" flag.  The core dump showsjs@goedel:pts/5(22)
adb  /phil/sw/sunos/sparc/obj/openssh-4.2p1/ssh core                           
      ~ 21:17
core file = core -- program ``ssh'' on platform SUNW,Ultra-250
SIGSEGV: Segmentation Fault
$C
bn_sub_words() + 3c
        [savfp=0xffbeef48,savpc=0x801bc]
bn_sub_part_words(13f400,13ba88,13baa8,7,1,ac0b2f1) + 10
        [savfp=0xffbeef48,savpc=0x801bc]
bn_mul_recursive(20,ffffffff,0,10,0,ffffffff) + 41c
        [savfp=0xffbeefd8,savpc=0x8006c]
bn_mul_recursive(40,ffffffff,13f360,20,0,ffffffff) + 2cc
        [savfp=0xffbef068,savpc=0x80ef8]
BN_mul(13e700,13e6b0,13e6c4,13e628,3,1) + 2b8
        [savfp=0xffbef0e0,savpc=0x836f0]
BN_mod_mul_montgomery(13e6b0,13e6b0,13e6c4,13e480,13e628,1f) + 30
        [savfp=0xffbef150,savpc=0x7ece4]
BN_mod_exp_mont_consttime(80,bb,ffbef244,b,13e628,13e480) + 424
        [savfp=0xffbef1d8,savpc=0x49068]
generate_key(13e5d0,20,12a490,0,130,218ac) + 1c8
        [savfp=0xffbef268,savpc=0x48e2c]
DH_generate_key(13e5d0,12a490,12a760,0,0,0) + c
        [savfp=0xffbef2d8,savpc=0x36338]
dh_gen_key(13e5d0,80,400,2000,ff1b5eec,ff146618) + 80
        [savfp=0xffbef348,savpc=0x34964]
kexgex_client(13b8c0,2,0,0,ff1b5eec,400) + 168
        [savfp=0xffbef3f0,savpc=0x31a98]
kex_input_kexinit(1,6,13b8c0,13b938,13ea20,2) + 45c
        [savfp=0xffbef468,savpc=0x31238]
dispatch_run(0,13b904,13b8c0,0,12a424,ff0000) + 54
        [savfp=0xffbef4e0,savpc=0x1d500]
ssh_kex2(114c00,125aa0,0,7efefeff,81010100,ff0000) + 124
        [savfp=0xffbef550,savpc=0x1b308]
ssh_login(126f34,ffbef7b8,4,4,127718,125800) + 30c
        [savfp=0xffbef7c8,savpc=0x13f34]
main(125800,125aa0,122000,126c00,127cd0,125800) + c20
        [savfp=0xffbefa78,savpc=0x1245c]
Comment 10 Darren Tucker 2005-09-27 11:00:06 AEST
(In reply to comment #9)
> (1) Looks like an OpenSSL 0.9.8 issue to me.  Does not happen with 0.9.7g.
>     0.9.8's "make test" was unproblematic, though.

What options did you use when you built openssl-0.9.8?  I'm trying to reproduce
the problem.

Comment 11 Christian Walther 2005-09-29 23:12:18 AEST
Today I built an OpenSSH 4.2p1 with OpenSSL0.9.7g. Both OpenSSL and OpenSSh
passed all tests, there wasn't one SegFault. After this I removed all build
directories and compiled OpenSSh with OpenSSL 0.9.8 again, including debugging
flags.
I don't know why, and I'm not really happy about it, but this time OpenSSH
passed all tests and seems to work flawlessly.
So, in my point of view, this bug might be closed without solution. I guess that
that probably the build environment wasn't sane. But this problem occured on
several build results created by at least two people on two different machines
(both running Solaris 10, thou).
Comment 12 Darren Tucker 2005-09-29 23:34:03 AEST
Argh, a Heisenbug!  I hate unsolved mysteries too, but I have no idea what else
to suggest.

There's a similar trace in bug #910 with a segfault at the same place (HP-UX, HP
ANSI C compiler).  I think it was openssl-0.9.8.

I'm going to leave this bug open for a while and see if we can collect any more
info.
Comment 13 Darren Tucker 2006-10-03 20:36:03 AEST
I'm now pretty sure this an OpenSSL bug.  I helped someone else with a crash in the same place (DH GEX) and was able to reproduce it.  It was a caused by a problem in UltraSPARC assembler implementation of bn_sub_words().  Since it's in the assembler code, building OpenSSL with "no-asm" will not exhibit the problem.

This is from OpenSSL's CVS log:

[quote]
revision 1.5
date: 2005/11/15 08:02:10;  author: appro;  state: Exp;  lines: +12 -0
Apply "better safe than sorry" approach after addressing sporadic SEGV in
bn_sub_words to the rest of the sparcv8plus.S.
----------------------------
revision 1.4
date: 2005/11/11 20:07:07;  author: appro;  state: Exp;  lines: +2 -2
Attempt to resolve sporadic SEGV crashes in bn_sub_words in OpenSSH. I'm
baffled why it crashes and does it sporadically...
[/quote]

(according to OpenSSL's CVS, this patch is in OpenSSL >= 0.9.7j and >= 0.9.8b).

I replaced only that file in openssl-0.9.8a, rebuilt everything and was no longer reproduce the problem.  I recommend that you upgrade to OpenSSL 0.9.8d (or the latest 0.9.7) and rebuild OpenSSH (if you haven't already).

It took a while, but I think we can now close this bug :-)
Comment 14 Darren Tucker 2007-03-01 23:28:56 AEDT
I'm pretty sure this one is now solved.  Please reopen if this is not the case.
Comment 15 Damien Miller 2008-04-04 09:55:13 AEDT
Close resolved bugs after release.