Bug 585 - sshd core dumping on IRIX 6.5.18 with VerifyReverseMapping enabled
Summary: sshd core dumping on IRIX 6.5.18 with VerifyReverseMapping enabled
Status: CLOSED INVALID
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: sshd (show other bugs)
Version: -current
Hardware: MIPS IRIX
: P2 major
Assignee: OpenSSH Bugzilla mailing list
URL:
Keywords:
: 574 (view as bug list)
Depends on:
Blocks:
 
Reported: 2003-06-03 22:47 AEST by Kevin Taylor
Modified: 2004-04-14 12:24 AEST (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Taylor 2003-06-03 22:47:53 AEST
** I'm re-opening this case (it was bug #574). I don't think it got entered
correctly into the system **


Occasionally, we're noticing that sshd is core dumping on our IRIX 6.5.18f machine.

The only time we've noticed it is when users are logging in with putty
from offsite (although this is not really a client issue).

The user manages to log in, sshd apparently core dumps, but the user is not
logged out, the privilege separated user is still running their own personal
sshd spawn, and the parent is 1, so the root owned sshd process is gone.

wtmp is not updated, so the only way you can tell the user is logged in is by
listing their processes.

The end user doesn't notice that anything happened...and this doesn't ALWAYS
happen, but I can't correlate any system event and this. It will happen when the
system is first started, and it will happen when it's busier.



First core:

   6 record_login(pid = 13759, ttyname = 0x1014a22c = "/dev/ttyq7", user =
0x101520d8 = "user1", uid = ####, host = 0x101522a8 =
"pcp01711145pcs.nrockv01.md.`omcast.net", addr = 0x7fff24b0, addrlen = 16)
["/usr/local/src/security/openssh-3.6.1p1/sshlogin.c":72, 0x1002be58]


Second core:

   6 record_login(pid = 182438, ttyname = 0x1014a22c = "/dev/ttyq39", user =
0x101520d8 = "user2", uid = ####, host = 0x10152358 =
"toronto-hse-ppp3760148.symp`tico.ca", addr = 0x7fff24b0, addrlen = 16)
["/usr/local/src/security/openssh-3.6.1p1/sshlogin.c":72, 0x1002be58]


For some reason, the 29th character of the hostname is messed up. The first
hostname should be .comcast.net, the second hostname should be sympatico.ca

After looking through the source code, the actual problem may lie in 
verify_reverse_mapping.

We had this option enabled in sshd_config, we disabled it and are currently
monitoring for the core dumps. If we don't see any, that may be the root of this
problem....hopefully it will point someone in the right direction towards fixing it.

After about 2 weeks, we have not had any core files, so it was definately this
option causing the crashing problem.
Comment 1 Damien Miller 2003-06-03 23:02:21 AEST
It looks like the hostnames are being scribbled over by something. Perhaps a bug
in getaddrinfo()?

Is Irix using our getaddrinfo() replacement? (check for HAVE_GETADDRINFO in
config.h)

I doubt that the bug is in our canohost.c file, as it is used on all platforms.

Also, did you compile in 64-bit mode?
Comment 2 Kevin Taylor 2003-06-03 23:05:44 AEST
/* Define to 1 if you have the `getaddrinfo' function. */
#define HAVE_GETADDRINFO 1

We compile in n32 mode.

Comment 3 Kevin Taylor 2003-06-03 23:09:45 AEST
This was also in our config.h

/* getaddrinfo is broken (if present) */
/* #undef BROKEN_GETADDRINFO */


I'm not sure if it matters much that we're using openssh-3.6.1p1, not p2.
Comment 4 Damien Miller 2003-06-04 08:41:07 AEST
Well, that indicates that you are using the system getaddrinfo function. We have
encountered bugs on some platforms' versions of these, but never ones leading to
crashes.
Comment 5 Damien Miller 2003-06-04 18:59:43 AEST
*** Bug 574 has been marked as a duplicate of this bug. ***
Comment 6 Damien Miller 2003-06-04 19:03:33 AEST
I just discovered your debugger output in bug #574 - this looks like things are
blowing up inside malloc(). This is usually an indication that memory has been
trashed before the call. 

Consider building against ElectricFence[1] or some other malloc debugging
library. This would likely show up the error at the time the corruption happens.

[1] ftp://ftp.perens.com/pub/ElectricFence/ (I have no idea whether or not it
works on Irix)
Comment 7 Kevin Taylor 2003-06-04 20:19:36 AEST
not having any luck getting to sites with dmalloc tools. Unfortunately I'm not
very experienced with source debugging, so hopefully these things are easy to
implement. 
Comment 8 Damien Miller 2003-06-04 20:24:50 AEST
I should also warn you that electricfence drives up memory usage considerably
Comment 9 Kevin Taylor 2003-06-04 20:29:08 AEST
that could be a problem then, the system we're seeing the problems on may run
into troubles with high memory usage from sshd. 

I may try forcing sshd to build using your getaddrinfo, and maybe that will
clear things up temporarily, although may not solve the actual problem.
Unfortunately we don't have a good test scenario that can generate this problem.
It has to happen on our main production box.
Comment 10 Kevin Taylor 2003-06-11 01:16:10 AEST
due to the security bug, we re-enabled verifyreversemapping, and immediately saw
core dumps again, so that just proves we're looking in the right spot.

Luckily the users are not inconvenienced by this.

Tomorrow, we're going to try using the sshd binary that uses the non-system
getaddrinfo function. (we rebuilt after unsetting  HAVE_GETADDRINFO in config.h)
Hopefully that's all we needed to do.
Comment 11 Kevin Taylor 2003-06-12 02:12:34 AEST
Ok. After using the fake-getaddrinfo, sshd is still crashing. Here's the latest
dbx output.

Is there anything else we can look at without resorting to memory debugging?


>  0 realfree(0x10165f80, 0x10151490, 0x10165f60, 0x73706561, 0x73706560,
0x7ffed420, 0x10, 0x0)
["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":527, 0xfb2466c]
   1 cleanfree(0x0, 0x10151490, 0x10165f60, 0x73706561, 0x73706560, 0x7ffed420,
0x10, 0x0)
["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":944, 0xfb24eac]
   2 __malloc(0x260, 0x10151490, 0x10165f60, 0x73706561, 0x73706560, 0x7ffed420,
0x10, 0x0)
["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":230, 0xfb240e0]
   3 _malloc(0x0, 0x10151490, 0x10165f60, 0x73706561, 0x73706560, 0x7ffed420,
0x10, 0x0)
["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":186, 0xfb23f4c]
   4 xmalloc(size = 608)
["/usr/local/src/security/openssh-3.6.1p1/xmalloc.c":28, 0x10065994]
   5 login_alloc_entry(pid = 20692179, username = 0x10151490 = "asdfa", hostname
= 0x10165f60 = "dsl093-055-063.blt1.dsl.spe`keasy.net", line = 0x1014a27c =
"/dev/ttyq25") ["/usr/local/src/security/openssh-3.6.1p1/loginrec.c":325,
0x10048b00]
   6 record_login(pid = 20692179, ttyname = 0x1014a27c = "/dev/ttyq25", user =
0x10151490 = "asdf", uid = ####, host = 0x10165f60 =
"dsl093-055-063.blt1.dsl.spe`keasy.net", addr = 0x7ffed420, addrlen = 16)
["/usr/local/src/security/openssh-3.6.1p1/sshlogin.c":72, 0x1002beb8]
   7 mm_record_login(s = 0x1014a248, pw = 0x1015dc08)
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1030, 0x10042c84]
   8 mm_answer_pty(socket = 6, m = 0x7ffed510)
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1080, 0x10042f2c]
   9 monitor_read(pmonitor = 0x101527c0, ent = 0x10137790, pent = (nil))
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":371, 0x10040f54]
   10 monitor_child_postauth(pmonitor = 0x101527c0)
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":334, 0x10040dac]
   11 privsep_postauth(authctxt = 0x101515b0)
["/usr/local/src/security/openssh-3.6.1p1/sshd.c":665, 0x10025f78]
   12 main(ac = 1, av = 0x7ffedf14)
["/usr/local/src/security/openssh-3.6.1p1/sshd.c":1533, 0x10028a88]
   13 __start()
["/xlv55/kudzu-apr12/work/irix/lib/libc/libc_n32_M4/csu/crt1text.s":177, 0x10024a48]
Comment 12 Darren Tucker 2003-06-12 20:41:18 AEST
Out of curiousity, what is MAXHOSTNAMELEN defined as on IRIX?
Comment 13 Kevin Taylor 2003-06-12 20:50:02 AEST
param.h:#define MAXHOSTNAMELEN 256              /* can't be longer than SYS_NMLN
- 1 */
Comment 14 Kevin Taylor 2003-06-12 20:55:41 AEST
FYI

utsname.h:#define _SYS_NMLN     257     /* 4.0 size of utsname elements.*/
Comment 15 Damien Miller 2003-06-13 09:20:51 AEST
If you aren't already, you may want to try a CVS snapshot to see if the problem
has already been fixed there. Otherwise you will have to try malloc debugging -
the crash is definitely occurring inside malloc.
Comment 16 Darren Tucker 2003-07-07 00:32:40 AEST
dmalloc (http://dmalloc.com/) claims to work on IRIX.  It's likely to increase
the CPU and memory load, though.

I've built with dmalloc on Linux thusly:
LDFLAGS=-ldmalloc ./configure && make
eval `dmalloc -l /path/to/log high`
./sshd [options]
Comment 17 Kevin Taylor 2003-07-14 21:41:36 AEST
I believe I built openssh with -ldmalloc 

I ran the command you suggested, but there's nothing being logged to that log file.

I've never used dmalloc before, so I'm not sure what I'm doing. 

Do you have anything special that needs to be set up in your .dmallocrc?
Comment 18 Darren Tucker 2003-07-14 22:35:03 AEST
No, I don't have a .dmallocrc.  Here is exactly what I did:

# eval `dmalloc -l /tmp/dmalloc.log high`
# ./sshd -d -p 2022
[debugging output snipped]
# ls -l /tmp/dmalloc.log
-rw-r--r--    1 dtucker  dtucker     39314 Jul 14 22:13 /tmp/dmalloc.log

There might be a couple of logging problems: there will be log per connection 
plus one for the master daemon (the dmalloc docs say you can use "%d" for a pid 
but that didn't work for me) and the logging will be done partly as the user not 
root.

Even without the logging dmalloc will be useful as it should abort the ssh 
session (with a core dump) as soon as it detects a problem, rather some time 
later.
Comment 19 Kevin Taylor 2003-07-15 00:13:20 AEST
well, I got a core dump and ran dbx against it and it doesn't look like anything
different to me. The corefile size was significantly larger (28MB vs 3MB for the
other)...so I assume dmalloc was working.

>  0 realfree(0x10152388, 0x101520e8, 0x0, 0x6e686463, 0x6e686460, 0x1,
0x10166968, 0x0)
["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":538, 0xfb24694]
   1 cleanfree(0x0, 0x101520e8, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166968,
0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":944,
0xfb24eac]
   2 __malloc(0x260, 0x101520e8, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166968,
0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":230,
0xfb240e0]
   3 _malloc(0x0, 0x101520e8, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166968, 0x0)
["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":186, 0xfb23f4c]
   4 xmalloc(size = 608)
["/usr/local/src/security/openssh-3.6.1p1/xmalloc.c":28, 0x10065934]
   5 login_alloc_entry(pid = 77144314, username = 0x101520e8 = "ktaylor",
hostname = 0x10152368 = "66-44-105-44.s806.apx2.lnhd`.md.dialup.rcn.com", line =
0x1014a20c = "/dev/ttyq51")
["/usr/local/src/security/openssh-3.6.1p1/loginrec.c":325, 0x10048aa0]
   6 record_login(pid = 77144314, ttyname = 0x1014a20c = "/dev/ttyq51", user =
0x101520e8 = "ktaylor", uid = ####, host = 0x10152368 =
"66-44-105-44.s806.apx2.lnhd`.md.dialup.rcn.com", addr = 0x7ffed410, addrlen =
16) ["/usr/local/src/security/openssh-3.6.1p1/sshlogin.c":72, 0x1002be58]
   7 mm_record_login(s = 0x1014a1d8, pw = 0x1015dbc8)
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1030, 0x10042c24]
   8 mm_answer_pty(socket = 7, m = 0x7ffed500)
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1080, 0x10042ecc]
   9 monitor_read(pmonitor = 0x10152780, ent = 0x10137730, pent = (nil))
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":371, 0x10040ef4]
   10 monitor_child_postauth(pmonitor = 0x10152780)
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":334, 0x10040d4c]
   11 privsep_postauth(authctxt = 0x10151570)
["/usr/local/src/security/openssh-3.6.1p1/sshd.c":665, 0x10025f18]
   12 main(ac = 3, av = 0x7ffedf04)
["/usr/local/src/security/openssh-3.6.1p1/sshd.c":1533, 0x10028a28]
   13 __start()
["/xlv55/kudzu-apr12/work/irix/lib/libc/libc_n32_M4/csu/crt1text.s":177, 0x100249e8]
Comment 20 Darren Tucker 2003-07-15 00:49:51 AEST
Some googling shows that openssh is not the only thing with these symptoms:
http://opendx.npaci.edu/mail/opendx-dev/2000.01/msg00068.html
http://mail.python.org/pipermail/python-bugs-list/2003-April/017283.html

The second is the same release of IRIX and hints at a cause in netdb.h relating 
to getaddrinfo/getnameinfo.

Try commenting out "#define HAVE_GETADDRINFO 1" in config.h and recompiling.

What signal is it dying with, SIGSEGV or SIGBUS?  Is your resolver configured 
for pure DNS/hosts or NIS?  Can SGI any help?  10 bucks says the IRIX 
getaddrinfo is trashing memory.
Comment 21 Kevin Taylor 2003-07-15 00:56:49 AEST
I thought we tried reubuilding with the openssh getaddrinfo...but our config.h
says otherwise. I'll give it another go.

It is dying with SIGSEGV

We're using hosts/dns, no nis here.

I can try SGI, but just like everyone else, as soon as you mention some third
party software they don't want to hear it.
Comment 22 Kevin Taylor 2003-07-15 01:23:58 AEST
ok. I commented out the getaddrinfo line from config.h reran the make (binary is
a little larger) and still got a core dump, which looks about the same:

>  0 realfree(0x10166d70, 0x101523b0, 0x0, 0x6e686463, 0x6e686460, 0x1,
0x10166d80, 0x0)
["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":538, 0xfb24694]
   1 cleanfree(0x0, 0x101523b0, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166d80,
0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":944,
0xfb24eac]
   2 __malloc(0x260, 0x101523b0, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166d80,
0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":230,
0xfb240e0]
   3 _malloc(0x0, 0x101523b0, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166d80, 0x0)
["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":186, 0xfb23f4c]
   4 xmalloc(size = 608)
["/usr/local/src/security/openssh-3.6.1p1/xmalloc.c":28, 0x10065994]
   5 login_alloc_entry(pid = 78265682, username = 0x101523b0 = "ktaylor",
hostname = 0x10166d50 = "66-44-42-202.s710.apx1.lnhd`.md.dialup.rcn.com", line =
0x1014a25c = "/dev/ttyq70")
["/usr/local/src/security/openssh-3.6.1p1/loginrec.c":325, 0x10048b00]
   6 record_login(pid = 78265682, ttyname = 0x1014a25c = "/dev/ttyq70", user =
0x101523b0 = "ktaylor", uid = ####, host = 0x10166d50 =
"66-44-42-202.s710.apx1.lnhd`.md.dialup.rcn.com", addr = 0x7ffed410, addrlen =
16) ["/usr/local/src/security/openssh-3.6.1p1/sshlogin.c":72, 0x1002beb8]
   7 mm_record_login(s = 0x1014a228, pw = 0x1015d960)
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1030, 0x10042c84]
   8 mm_answer_pty(socket = 7, m = 0x7ffed500)
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1080, 0x10042f2c]
   9 monitor_read(pmonitor = 0x101527a0, ent = 0x10137770, pent = (nil))
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":371, 0x10040f54]
   10 monitor_child_postauth(pmonitor = 0x101527a0)
["/usr/local/src/security/openssh-3.6.1p1/monitor.c":334, 0x10040dac]
   11 privsep_postauth(authctxt = 0x10151590)
["/usr/local/src/security/openssh-3.6.1p1/sshd.c":665, 0x10025f78]
   12 main(ac = 3, av = 0x7ffedf04)
["/usr/local/src/security/openssh-3.6.1p1/sshd.c":1533, 0x10028a88]
   13 __start()
["/xlv55/kudzu-apr12/work/irix/lib/libc/libc_n32_M4/csu/crt1text.s":177, 0x10024a48]
Comment 23 Kevin Taylor 2003-07-15 01:29:07 AEST
If I can find the time, I've got something I may try. 

http://freeware.sgi.com has openssh-3.5p1 available. I don't know if the problem
existed at that time, if it did, I can download the sgi release of it and see
what the source code differences are and see if that points to something.

Of course, if 3.5p1 direct from openssh works fine, then we're still stuck.
Comment 24 Kevin Taylor 2003-07-15 22:33:45 AEST
I've been trying several different things today and set this line in config.h

#define BROKEN_GETADDRINFO

(in addition to the other GETADDRINFO line in config.h)

and I get this error...

"../openbsd-compat/fake-getaddrinfo.h", line 40: error(1143): declaration is
          incompatible with "const char *gai_strerror(int)" (declared at line
          147 of "/usr/include/netdb.h")
  char *gai_strerror(int ecode);
        ^
Comment 25 Kevin Taylor 2003-07-15 23:20:32 AEST
ok, I don't really want to jinx anything by posting this, but I think it's
starting to work now.

After I set the #define BROKEN_GETADDRINFO line, I got that error message listed
in the last message...what I did to get it to compile was to comment out the
offending line in /usr/include/netdb.h...just to get it to compile, and that
made it happy.

At this point it does look like a problem with SGI's getaddrinfo, BUT getting
the fake-addrinfo to build into ssh requires at least the BROKEN_ADDRINFO define
to be set, and if someone can offer a clean way to get the openbsd-compat stuff
to build (rather than editing the system headers), that would be a good solution.

I hope that this new binary is really working and my hopes haven't been
prematurely raised. :)
Comment 26 Kevin Taylor 2003-07-17 00:04:28 AEST
I found that making these changes gets the stuff built without modifying any
system includes:

openbsd-compat/fake-getaddrinfo.h

40c40
< char *gai_strerror(int ecode);
---
> const char *gai_strerror(int ecode);


openbsd-compat/fake-getaddrinfo.c

18c18
< char *gai_strerror(int ecode)
---
> const char *gai_strerror(int ecode)
Comment 27 Darren Tucker 2003-09-05 13:39:30 AEST
What's the status of this?

At the moment, my understanding is:
* a bug exists in getaddrinfo in IRIX 6.5.18 and up
* defining BROKEN_GETADDRINFO causes a type clash with gai_strerror
* solving the type clash results in an sshd that works OK

Should we be defining BROKEN_GETADDRINFO for some IRIXes?  If so, which
versions, and is there a clean way to solve the type clash?

Comment 28 Kevin Taylor 2003-09-05 22:43:43 AEST
Unfortunately at this time I can't confirm that the problem has gone away in the
recent version of IRIX (6.5.21), maybe in a few months we'll have an updated
machine we can try it on.
Comment 29 Darren Tucker 2003-12-22 21:22:35 AEDT
The "const char *gai_strerror" thing is now handled:

20030923
[snip]
 - (dtucker) [configure.ac openbsd-compat/fake-rfc2553.c
   openbsd-compat/fake-rfc2553.h] Bug #659: Test for and handle systems with
   where gai_strerror is defined as "const char *".  Part of patch supplied
   by bugzilla-openssh at thewrittenword.com

Since it appears that the root cause is a bug in an OS library, I'm closing this
bug.  Please re-open if you can identify a fault in OpenSSH.
Comment 30 Damien Miller 2004-04-14 12:24:19 AEST
Mass change of RESOLVED bugs to CLOSED