We are using OpenSSH sshd built with the tcp_wrappers library, and rules set to deny access not coming from our local domain. Recently we have seen cases where an sshd process was left running and consuming a large amount of CPU. Looking at the logs and the time the process was started, it appears that the trigger was a denied ssh connection blocked by tcp_wrappers. (I suspect this was the password guessing attack that's been going around recently, because we've gotten few blocked ssh connections in the past, but I can't say for sure.) This was on Solaris 8, openssh-3.9p1, OpenSSL 0.9.7d, tcp_wrappers 7.6 uname -a SunOS XXXXXX 5.8 Generic_108528-18 sun4u sparc SUNW,Sun-Fire-280R
Created attachment 737 [details] This is the shell script used to configure this build of openssh
The code that drops the connection is pretty simple and there's no obvious way for it to get into a loop: if (!hosts_access(&req)) { debug("Connection refused by tcp wrapper"); refuse(&req); /* NOTREACHED */ fatal("libwrap refuse returns"); } When it happens, can you run /usr/ucb/ps auxwww and pick out the pid of the errant process? It should have a few hints about what stage the process is at in the process title. Also, can you reproduce it with sshd in debug mode (eg /path/to/sshd -ddde)? If so, please attach (note: use "Create New Attachment") the debug log to this bug.
Also worth trying: patch #772 in bug #973
I think attachment #773 [details] should solve this (it fixed 973). Could you please test it and report?
I'm now pretty sure this is a dupe of bug #973. Please reopen this bug if the patch on bug #973 does not resolve this problem. *** This bug has been marked as a duplicate of 973 ***
I tested with the patch but still the sshd process goes to the high end of CPU cycles. This happens in hpux 11.11. This also happens irrespective of -r option for sshd.
OK I'll see if I can reproduce it.
Could you please provide the debugging requested in comment #2? Specifically: When it happens, can you run /usr/ucb/ps auxwww and pick out the pid of the errant process? It should have a few hints about what stage the process is at in the process title. Also, can you reproduce it with sshd in debug mode (eg /path/to/sshd -ddde)? If so, please attach (note: use "Create New Attachment") the debug log to this bug.
Created attachment 788 [details] ps output for sshd taking high end of CPU Im not able to reproduce the problem in debug mode. Here are few things that I found in hpux. The listening sshd goes to high end after the denied message of tcp_wrappers. I checked it with top command. Particularly, in hpux 11.11 this happens after 3rd blocked connection. However, in hpux 11.23 this happens after 5th connection.
Have not been able to reproduce on either Solaris or HP-UX. Marking bug as all-platform. Senthil: a couple of questions: - what options did you build OpenSSH with? - what do the deny rules generally look like (eg deny by IP or DNS name)?
>what options did you build OpenSSH with? ./configure --prefix --with-pam --with-ssl-dir --with-tcp-wrappers --with-zlib --with-kerberos5 --with-rand-helper --with-md5-passwords --with-4in6 >what do the deny rules generally look like (eg deny by IP or DNS name)? deny by IP.
Still can't reproduce it. Another thing to try: doing a system call trace the cpu-eating sshd and see what it's doing. That's truss on Solaris and tusc on HP-UX. (tusc is an unsupported tool supplied by HP: ftp://ftp.cup.hp.com/dist/networking/tools/)
(In reply to comment #12) > Still can't reproduce it. Another thing to try: doing a system call trace the > cpu-eating sshd and see what it's doing. > > That's truss on Solaris and tusc on HP-UX. (tusc is an unsupported tool > supplied by HP: ftp://ftp.cup.hp.com/dist/networking/tools/) I haven't found a way to reproduce the bug at will, though I suspect some aggressive attack script may be involved, because of the timing of when I started seeing the problems. The patch doesn't seem to help me, on Solaris 8, either. I was using CVS snapshorts of OpenSSL and OpenSSH: openssl-0.9.7-stable-SNAP-20050110 openssh-SNAP-20050204 with this script to run configure: - - - #!/usr/bin/csh -vx rm -f config.cache set path=(/usr/local/openssl-r/bin $path) setenv CC "/opt/SUNWspro/bin/cc" #setenv CFLAGS "-I/usr/local/openssl-r/include -I/usr/local/zlib-1.2.1/include -I/usr/local/include" #setenv LFLAGS "-L/usr/local/openssl-r -L/usr/local/zlib-1.2.1/lib -L/usr/local/lib" setenv CFLAGS "-I/usr/local/openssl-r/include -I/usr/local/include" setenv LFLAGS "-L/usr/local/openssl-r -L/usr/local/lib" nohup sh ./configure --prefix=/usr/local/openssh-r \ --sysconfdir=/etc/openssh \ --with-ssl-dir=/usr/local/openssl-r/lib \ --with-prngd-socket=/local-adm-pub/prngd/egd-pool \ --with-tcp-wrappers \ --with-4in6 \ --with-pam \ # --with-zlib=/usr/local/zlib-1.2.1/lib \ --with-xauth=/usr/openwin/bin/xauth \ --with-default-path="/usr/bin:/usr/ucb:/usr/openwin/bin:/usr/local/bin" \ >& my.configure.out.$$ & # echo my.configure.out.$$ # ps -f # - - -
Created attachment 822 [details] trust output on high CPU process I tried various truss options, seeing something more verbose/informative. This is a sample of the output of truss -f -rall -vall -D -p 4183 | & more It seems to only be showing calls to read, a few of which take a lot of time, and most of which don't take much.
Created attachment 823 [details] ps -ef output for high CPU process used in truss This is the output of ps -ef on the failing processes in the case I did the truss on.
Created attachment 824 [details] corresponding syslog messages from ssh, tcp_wrappers, prngd These are the messages in syslog from about the time that the ps output seems to imply the high-cpu processes started. I ran egrep '20:01:3' on the log file. This file gets messages from prngd, sshd, and tcp_wrappers. I note there's an error from prngd (which I'm using as a random number source), could that be a factor in the problem? (I'm using prngd, because this server was a HP-UX box in a previous life..) The tcp_wrappers rules are first a number of exception rules of the form: sshd,in.ftpd: SOME_ADDRESS : rfc931 15 : keepalive : nice 1 : allow Where SOME_ADDRESS is an IP address, a DNS host address, or a domain suffix(.foo.example.com) for which we want to allow traffic: There's a generic allow rule for on-campus traffic: sshd: .ourdomain.edu : nice 1 : allow anything else falls thru to a default deny rule: ALL: ALL : deny Could the use of rfc931 lookups trigger problems?
Created attachment 825 [details] corresponding output of /usr/ucb/ps -auxwww This is the output of /usr/ucb/ps -auxwww, around the time the problem started, as captured by a cron job running at 15 minute intervals.
(In reply to comment #14) > It seems to only be showing calls to read, a few of which take a lot of time, > and most of which don't take much. This read loop is probably the cause of the CPU utilization. It would appears that something, somewhere keeps retrying a read after a descriptor closes. A couple of suggestions for more getting more info: Can you use lsof or similar to find out what the descriptor is? (It's 3 in this case but there's no guarantee it'll be the same, so check with truss first). Attach a debugger to the looping process and get a backtrace to find out where it's stuck. With gdb this is something like: # gdb /path/to/sshd pid (gdb) backtrace If you use the sshd before it's stripped (ie from the build dir) then you will get better info. You don't have to be running the one with the debug symbols, you can be running the stripped sshd and use the one with the debug symbols with gdb as long as the two sshd's match.
I was able to reproduce the problem, by creating a tcp_wrappers deny rule for a local box, running a test server on a high port, and running the "ab" (apache benchmark utility) against that port. In fact it was even easier to mess it up by running "ab" against the port without the tcp_wrappers deny rule. This binary was build with the Sun compiler suite, not gcc, so I used the corresponding "dpx" debugger traceback feature. It looks like it's looping deep in the random number generation code; several traces all looked similar. I suspect the common factor between me and the other case is use of prngd or a similar daemon, rather than the sun /dev/random patch. This set-up was brought over from HP-UX which didn't have a /dev/random patch, and the other report is from HP-UX. Plus, there were syslog errors from prngd.
Created attachment 830 [details] ps output and dbx traceback This is the results of running ab -n 100 -t 30 -c 50 http://HOSTNAME:4022/ on the test server port from a blocked host. Afterwards I did a ps and a traceback with dbx on one of the runaway processes.
Created attachment 831 [details] check for closed fd in openssl's RAND_query_egd_bytes This appears to be a bug in OpenSSL's prngd interface. I think what is happening is prngd is closing sockets (the "fairness" limits? if so does the problem occur with prngd-0.9.27?) and OpenSSL does not handle the the descriptor closing (read() will return zero) and repeatedly retries the read. Please try this patch to OpenSSL (against 0.9.7e but may apply to others), rebuild then rebuild OpenSSH with the new OpenSSL.
Created attachment 832 [details] a better fix for closed fd in openssl's RAND_query_egd_bytes
Created attachment 833 [details] hack prngd to provoke failure at will This is a nasty hack to prngd in case anyone wants to reproduce this: apply to prngd-0.9.29 build and run "./prngd /var/run/egd-pool", then run "openssl rand -out /dev/null 512" and watch the CPU burn... I have submitted this and the fix upstream to OpenSSL's request tracker (via email, no request number yet).
Comment on attachment 832 [details] a better fix for closed fd in openssl's RAND_query_egd_bytes > num = read(fd, egdbuf, 1); >- if (num >= 0) >+ if (num == 0) >+ goto err; /* descriptor closed */ >+ else if (num > 0) What about the (num == -1) case? Does it cope with EINTR and EAGAIN?
(In reply to comment #24) > What about the (num == -1) case? Does it cope with EINTR and EAGAIN? Yeah, that's already handled a bit further down, just out reach of the context. The entire block of code (compacted to save space) is: if (num == 0) goto err; /* descriptor closed */ else if (num > 0) numbytes += num; else { switch (errno) { #ifdef EINTR case EINTR: #endif #ifdef EAGAIN case EAGAIN: #endif /* No error, try again */ break; default: ret = -1; goto err; /* failure */ }
The OpenSSL folks seem to have accepted the bug and applied the patch: http://www.aet.tu-cottbus.de/rt2/Ticket/Display.html?id=1014 If the patch fixes your problem, please close this bug as it's not a problem with OpenSSH.
With the patch 832 to OpenSSL, test loads that previously produced several run-away processes, produced no extra lingering processes. (This was based on the same CVS snapshots of OpenSSL and OpenSSH source used in my truss and dbx debug tests, which includes the previously suggested patch from bug 973.)
Change all RESOLVED bug to CLOSED with the exception of the ones fixed post-4.4.