** I'm re-opening this case (it was bug #574). I don't think it got entered correctly into the system ** Occasionally, we're noticing that sshd is core dumping on our IRIX 6.5.18f machine. The only time we've noticed it is when users are logging in with putty from offsite (although this is not really a client issue). The user manages to log in, sshd apparently core dumps, but the user is not logged out, the privilege separated user is still running their own personal sshd spawn, and the parent is 1, so the root owned sshd process is gone. wtmp is not updated, so the only way you can tell the user is logged in is by listing their processes. The end user doesn't notice that anything happened...and this doesn't ALWAYS happen, but I can't correlate any system event and this. It will happen when the system is first started, and it will happen when it's busier. First core: 6 record_login(pid = 13759, ttyname = 0x1014a22c = "/dev/ttyq7", user = 0x101520d8 = "user1", uid = ####, host = 0x101522a8 = "pcp01711145pcs.nrockv01.md.`omcast.net", addr = 0x7fff24b0, addrlen = 16) ["/usr/local/src/security/openssh-3.6.1p1/sshlogin.c":72, 0x1002be58] Second core: 6 record_login(pid = 182438, ttyname = 0x1014a22c = "/dev/ttyq39", user = 0x101520d8 = "user2", uid = ####, host = 0x10152358 = "toronto-hse-ppp3760148.symp`tico.ca", addr = 0x7fff24b0, addrlen = 16) ["/usr/local/src/security/openssh-3.6.1p1/sshlogin.c":72, 0x1002be58] For some reason, the 29th character of the hostname is messed up. The first hostname should be .comcast.net, the second hostname should be sympatico.ca After looking through the source code, the actual problem may lie in verify_reverse_mapping. We had this option enabled in sshd_config, we disabled it and are currently monitoring for the core dumps. If we don't see any, that may be the root of this problem....hopefully it will point someone in the right direction towards fixing it. After about 2 weeks, we have not had any core files, so it was definately this option causing the crashing problem.
It looks like the hostnames are being scribbled over by something. Perhaps a bug in getaddrinfo()? Is Irix using our getaddrinfo() replacement? (check for HAVE_GETADDRINFO in config.h) I doubt that the bug is in our canohost.c file, as it is used on all platforms. Also, did you compile in 64-bit mode?
/* Define to 1 if you have the `getaddrinfo' function. */ #define HAVE_GETADDRINFO 1 We compile in n32 mode.
This was also in our config.h /* getaddrinfo is broken (if present) */ /* #undef BROKEN_GETADDRINFO */ I'm not sure if it matters much that we're using openssh-3.6.1p1, not p2.
Well, that indicates that you are using the system getaddrinfo function. We have encountered bugs on some platforms' versions of these, but never ones leading to crashes.
*** Bug 574 has been marked as a duplicate of this bug. ***
I just discovered your debugger output in bug #574 - this looks like things are blowing up inside malloc(). This is usually an indication that memory has been trashed before the call. Consider building against ElectricFence[1] or some other malloc debugging library. This would likely show up the error at the time the corruption happens. [1] ftp://ftp.perens.com/pub/ElectricFence/ (I have no idea whether or not it works on Irix)
not having any luck getting to sites with dmalloc tools. Unfortunately I'm not very experienced with source debugging, so hopefully these things are easy to implement.
I should also warn you that electricfence drives up memory usage considerably
that could be a problem then, the system we're seeing the problems on may run into troubles with high memory usage from sshd. I may try forcing sshd to build using your getaddrinfo, and maybe that will clear things up temporarily, although may not solve the actual problem. Unfortunately we don't have a good test scenario that can generate this problem. It has to happen on our main production box.
due to the security bug, we re-enabled verifyreversemapping, and immediately saw core dumps again, so that just proves we're looking in the right spot. Luckily the users are not inconvenienced by this. Tomorrow, we're going to try using the sshd binary that uses the non-system getaddrinfo function. (we rebuilt after unsetting HAVE_GETADDRINFO in config.h) Hopefully that's all we needed to do.
Ok. After using the fake-getaddrinfo, sshd is still crashing. Here's the latest dbx output. Is there anything else we can look at without resorting to memory debugging? > 0 realfree(0x10165f80, 0x10151490, 0x10165f60, 0x73706561, 0x73706560, 0x7ffed420, 0x10, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":527, 0xfb2466c] 1 cleanfree(0x0, 0x10151490, 0x10165f60, 0x73706561, 0x73706560, 0x7ffed420, 0x10, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":944, 0xfb24eac] 2 __malloc(0x260, 0x10151490, 0x10165f60, 0x73706561, 0x73706560, 0x7ffed420, 0x10, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":230, 0xfb240e0] 3 _malloc(0x0, 0x10151490, 0x10165f60, 0x73706561, 0x73706560, 0x7ffed420, 0x10, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":186, 0xfb23f4c] 4 xmalloc(size = 608) ["/usr/local/src/security/openssh-3.6.1p1/xmalloc.c":28, 0x10065994] 5 login_alloc_entry(pid = 20692179, username = 0x10151490 = "asdfa", hostname = 0x10165f60 = "dsl093-055-063.blt1.dsl.spe`keasy.net", line = 0x1014a27c = "/dev/ttyq25") ["/usr/local/src/security/openssh-3.6.1p1/loginrec.c":325, 0x10048b00] 6 record_login(pid = 20692179, ttyname = 0x1014a27c = "/dev/ttyq25", user = 0x10151490 = "asdf", uid = ####, host = 0x10165f60 = "dsl093-055-063.blt1.dsl.spe`keasy.net", addr = 0x7ffed420, addrlen = 16) ["/usr/local/src/security/openssh-3.6.1p1/sshlogin.c":72, 0x1002beb8] 7 mm_record_login(s = 0x1014a248, pw = 0x1015dc08) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1030, 0x10042c84] 8 mm_answer_pty(socket = 6, m = 0x7ffed510) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1080, 0x10042f2c] 9 monitor_read(pmonitor = 0x101527c0, ent = 0x10137790, pent = (nil)) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":371, 0x10040f54] 10 monitor_child_postauth(pmonitor = 0x101527c0) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":334, 0x10040dac] 11 privsep_postauth(authctxt = 0x101515b0) ["/usr/local/src/security/openssh-3.6.1p1/sshd.c":665, 0x10025f78] 12 main(ac = 1, av = 0x7ffedf14) ["/usr/local/src/security/openssh-3.6.1p1/sshd.c":1533, 0x10028a88] 13 __start() ["/xlv55/kudzu-apr12/work/irix/lib/libc/libc_n32_M4/csu/crt1text.s":177, 0x10024a48]
Out of curiousity, what is MAXHOSTNAMELEN defined as on IRIX?
param.h:#define MAXHOSTNAMELEN 256 /* can't be longer than SYS_NMLN - 1 */
FYI utsname.h:#define _SYS_NMLN 257 /* 4.0 size of utsname elements.*/
If you aren't already, you may want to try a CVS snapshot to see if the problem has already been fixed there. Otherwise you will have to try malloc debugging - the crash is definitely occurring inside malloc.
dmalloc (http://dmalloc.com/) claims to work on IRIX. It's likely to increase the CPU and memory load, though. I've built with dmalloc on Linux thusly: LDFLAGS=-ldmalloc ./configure && make eval `dmalloc -l /path/to/log high` ./sshd [options]
I believe I built openssh with -ldmalloc I ran the command you suggested, but there's nothing being logged to that log file. I've never used dmalloc before, so I'm not sure what I'm doing. Do you have anything special that needs to be set up in your .dmallocrc?
No, I don't have a .dmallocrc. Here is exactly what I did: # eval `dmalloc -l /tmp/dmalloc.log high` # ./sshd -d -p 2022 [debugging output snipped] # ls -l /tmp/dmalloc.log -rw-r--r-- 1 dtucker dtucker 39314 Jul 14 22:13 /tmp/dmalloc.log There might be a couple of logging problems: there will be log per connection plus one for the master daemon (the dmalloc docs say you can use "%d" for a pid but that didn't work for me) and the logging will be done partly as the user not root. Even without the logging dmalloc will be useful as it should abort the ssh session (with a core dump) as soon as it detects a problem, rather some time later.
well, I got a core dump and ran dbx against it and it doesn't look like anything different to me. The corefile size was significantly larger (28MB vs 3MB for the other)...so I assume dmalloc was working. > 0 realfree(0x10152388, 0x101520e8, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166968, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":538, 0xfb24694] 1 cleanfree(0x0, 0x101520e8, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166968, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":944, 0xfb24eac] 2 __malloc(0x260, 0x101520e8, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166968, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":230, 0xfb240e0] 3 _malloc(0x0, 0x101520e8, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166968, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":186, 0xfb23f4c] 4 xmalloc(size = 608) ["/usr/local/src/security/openssh-3.6.1p1/xmalloc.c":28, 0x10065934] 5 login_alloc_entry(pid = 77144314, username = 0x101520e8 = "ktaylor", hostname = 0x10152368 = "66-44-105-44.s806.apx2.lnhd`.md.dialup.rcn.com", line = 0x1014a20c = "/dev/ttyq51") ["/usr/local/src/security/openssh-3.6.1p1/loginrec.c":325, 0x10048aa0] 6 record_login(pid = 77144314, ttyname = 0x1014a20c = "/dev/ttyq51", user = 0x101520e8 = "ktaylor", uid = ####, host = 0x10152368 = "66-44-105-44.s806.apx2.lnhd`.md.dialup.rcn.com", addr = 0x7ffed410, addrlen = 16) ["/usr/local/src/security/openssh-3.6.1p1/sshlogin.c":72, 0x1002be58] 7 mm_record_login(s = 0x1014a1d8, pw = 0x1015dbc8) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1030, 0x10042c24] 8 mm_answer_pty(socket = 7, m = 0x7ffed500) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1080, 0x10042ecc] 9 monitor_read(pmonitor = 0x10152780, ent = 0x10137730, pent = (nil)) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":371, 0x10040ef4] 10 monitor_child_postauth(pmonitor = 0x10152780) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":334, 0x10040d4c] 11 privsep_postauth(authctxt = 0x10151570) ["/usr/local/src/security/openssh-3.6.1p1/sshd.c":665, 0x10025f18] 12 main(ac = 3, av = 0x7ffedf04) ["/usr/local/src/security/openssh-3.6.1p1/sshd.c":1533, 0x10028a28] 13 __start() ["/xlv55/kudzu-apr12/work/irix/lib/libc/libc_n32_M4/csu/crt1text.s":177, 0x100249e8]
Some googling shows that openssh is not the only thing with these symptoms: http://opendx.npaci.edu/mail/opendx-dev/2000.01/msg00068.html http://mail.python.org/pipermail/python-bugs-list/2003-April/017283.html The second is the same release of IRIX and hints at a cause in netdb.h relating to getaddrinfo/getnameinfo. Try commenting out "#define HAVE_GETADDRINFO 1" in config.h and recompiling. What signal is it dying with, SIGSEGV or SIGBUS? Is your resolver configured for pure DNS/hosts or NIS? Can SGI any help? 10 bucks says the IRIX getaddrinfo is trashing memory.
I thought we tried reubuilding with the openssh getaddrinfo...but our config.h says otherwise. I'll give it another go. It is dying with SIGSEGV We're using hosts/dns, no nis here. I can try SGI, but just like everyone else, as soon as you mention some third party software they don't want to hear it.
ok. I commented out the getaddrinfo line from config.h reran the make (binary is a little larger) and still got a core dump, which looks about the same: > 0 realfree(0x10166d70, 0x101523b0, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166d80, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":538, 0xfb24694] 1 cleanfree(0x0, 0x101523b0, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166d80, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":944, 0xfb24eac] 2 __malloc(0x260, 0x101523b0, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166d80, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":230, 0xfb240e0] 3 _malloc(0x0, 0x101523b0, 0x0, 0x6e686463, 0x6e686460, 0x1, 0x10166d80, 0x0) ["/xlv86/patches/5015/work/irix/lib/libc/libc_n32_M4/gen/malloc.c":186, 0xfb23f4c] 4 xmalloc(size = 608) ["/usr/local/src/security/openssh-3.6.1p1/xmalloc.c":28, 0x10065994] 5 login_alloc_entry(pid = 78265682, username = 0x101523b0 = "ktaylor", hostname = 0x10166d50 = "66-44-42-202.s710.apx1.lnhd`.md.dialup.rcn.com", line = 0x1014a25c = "/dev/ttyq70") ["/usr/local/src/security/openssh-3.6.1p1/loginrec.c":325, 0x10048b00] 6 record_login(pid = 78265682, ttyname = 0x1014a25c = "/dev/ttyq70", user = 0x101523b0 = "ktaylor", uid = ####, host = 0x10166d50 = "66-44-42-202.s710.apx1.lnhd`.md.dialup.rcn.com", addr = 0x7ffed410, addrlen = 16) ["/usr/local/src/security/openssh-3.6.1p1/sshlogin.c":72, 0x1002beb8] 7 mm_record_login(s = 0x1014a228, pw = 0x1015d960) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1030, 0x10042c84] 8 mm_answer_pty(socket = 7, m = 0x7ffed500) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":1080, 0x10042f2c] 9 monitor_read(pmonitor = 0x101527a0, ent = 0x10137770, pent = (nil)) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":371, 0x10040f54] 10 monitor_child_postauth(pmonitor = 0x101527a0) ["/usr/local/src/security/openssh-3.6.1p1/monitor.c":334, 0x10040dac] 11 privsep_postauth(authctxt = 0x10151590) ["/usr/local/src/security/openssh-3.6.1p1/sshd.c":665, 0x10025f78] 12 main(ac = 3, av = 0x7ffedf04) ["/usr/local/src/security/openssh-3.6.1p1/sshd.c":1533, 0x10028a88] 13 __start() ["/xlv55/kudzu-apr12/work/irix/lib/libc/libc_n32_M4/csu/crt1text.s":177, 0x10024a48]
If I can find the time, I've got something I may try. http://freeware.sgi.com has openssh-3.5p1 available. I don't know if the problem existed at that time, if it did, I can download the sgi release of it and see what the source code differences are and see if that points to something. Of course, if 3.5p1 direct from openssh works fine, then we're still stuck.
I've been trying several different things today and set this line in config.h #define BROKEN_GETADDRINFO (in addition to the other GETADDRINFO line in config.h) and I get this error... "../openbsd-compat/fake-getaddrinfo.h", line 40: error(1143): declaration is incompatible with "const char *gai_strerror(int)" (declared at line 147 of "/usr/include/netdb.h") char *gai_strerror(int ecode); ^
ok, I don't really want to jinx anything by posting this, but I think it's starting to work now. After I set the #define BROKEN_GETADDRINFO line, I got that error message listed in the last message...what I did to get it to compile was to comment out the offending line in /usr/include/netdb.h...just to get it to compile, and that made it happy. At this point it does look like a problem with SGI's getaddrinfo, BUT getting the fake-addrinfo to build into ssh requires at least the BROKEN_ADDRINFO define to be set, and if someone can offer a clean way to get the openbsd-compat stuff to build (rather than editing the system headers), that would be a good solution. I hope that this new binary is really working and my hopes haven't been prematurely raised. :)
I found that making these changes gets the stuff built without modifying any system includes: openbsd-compat/fake-getaddrinfo.h 40c40 < char *gai_strerror(int ecode); --- > const char *gai_strerror(int ecode); openbsd-compat/fake-getaddrinfo.c 18c18 < char *gai_strerror(int ecode) --- > const char *gai_strerror(int ecode)
What's the status of this? At the moment, my understanding is: * a bug exists in getaddrinfo in IRIX 6.5.18 and up * defining BROKEN_GETADDRINFO causes a type clash with gai_strerror * solving the type clash results in an sshd that works OK Should we be defining BROKEN_GETADDRINFO for some IRIXes? If so, which versions, and is there a clean way to solve the type clash?
Unfortunately at this time I can't confirm that the problem has gone away in the recent version of IRIX (6.5.21), maybe in a few months we'll have an updated machine we can try it on.
The "const char *gai_strerror" thing is now handled: 20030923 [snip] - (dtucker) [configure.ac openbsd-compat/fake-rfc2553.c openbsd-compat/fake-rfc2553.h] Bug #659: Test for and handle systems with where gai_strerror is defined as "const char *". Part of patch supplied by bugzilla-openssh at thewrittenword.com Since it appears that the root cause is a bug in an OS library, I'm closing this bug. Please re-open if you can identify a fault in OpenSSH.
Mass change of RESOLVED bugs to CLOSED