Bug 948 - high CPU in sshd after tcp_wrappers deny
Summary: high CPU in sshd after tcp_wrappers deny
Status: CLOSED FIXED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: sshd (show other bugs)
Version: 3.9p1
Hardware: All All
: P2 normal
Assignee: OpenSSH Bugzilla mailing list
URL: http://www.aet.tu-cottbus.de/rt2/Tick...
Keywords:
Depends on:
Blocks:
 
Reported: 2004-11-01 06:59 AEDT by Albert Lunde
Modified: 2006-10-07 11:37 AEST (History)
1 user (show)

See Also:


Attachments
This is the shell script used to configure this build of openssh (930 bytes, text/plain)
2004-11-01 07:05 AEDT, Albert Lunde
no flags Details
ps output for sshd taking high end of CPU (732 bytes, text/plain)
2005-01-25 23:39 AEDT, senthilkumar
no flags Details
trust output on high CPU process (61.68 KB, text/plain)
2005-02-15 04:51 AEDT, Albert Lunde
no flags Details
ps -ef output for high CPU process used in truss (582 bytes, text/plain)
2005-02-15 05:11 AEDT, Albert Lunde
no flags Details
corresponding syslog messages from ssh, tcp_wrappers, prngd (17.76 KB, text/plain)
2005-02-15 05:30 AEDT, Albert Lunde
no flags Details
corresponding output of /usr/ucb/ps -auxwww (1.67 KB, text/plain)
2005-02-15 05:40 AEDT, Albert Lunde
no flags Details
ps output and dbx traceback (3.31 KB, text/plain)
2005-02-19 11:38 AEDT, Albert Lunde
no flags Details
check for closed fd in openssl's RAND_query_egd_bytes (538 bytes, patch)
2005-02-19 13:55 AEDT, Darren Tucker
no flags Details | Diff
a better fix for closed fd in openssl's RAND_query_egd_bytes (688 bytes, patch)
2005-02-19 18:41 AEDT, Darren Tucker
no flags Details | Diff
hack prngd to provoke failure at will (392 bytes, patch)
2005-02-19 18:57 AEDT, Darren Tucker
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Albert Lunde 2004-11-01 06:59:17 AEDT
We are using OpenSSH sshd built with the tcp_wrappers library, and rules set to
deny access not coming from our local domain.

Recently we have seen cases where an sshd process was left running and consuming
a large amount of CPU. Looking at the logs and the time the process was started,
it appears that the trigger was a denied ssh connection blocked by tcp_wrappers. 

(I suspect this was the password guessing attack that's been going around
recently, because we've gotten few blocked ssh connections in the past, but I
can't say for sure.)

This was on Solaris 8, openssh-3.9p1, OpenSSL 0.9.7d, tcp_wrappers 7.6

uname -a
SunOS XXXXXX 5.8 Generic_108528-18 sun4u sparc SUNW,Sun-Fire-280R
Comment 1 Albert Lunde 2004-11-01 07:05:21 AEDT
Created attachment 737 [details]
This is the shell script used to configure this build of openssh
Comment 2 Darren Tucker 2004-11-02 22:01:30 AEDT
The code that drops the connection is pretty simple and there's no obvious way
for it to get into a loop:

if (!hosts_access(&req)) {
      debug("Connection refused by tcp wrapper");
      refuse(&req);
      /* NOTREACHED */
      fatal("libwrap refuse returns");
}

When it happens, can you run /usr/ucb/ps auxwww and pick out the pid of the
errant process?  It should have a few hints about what stage the process is at
in the process title.

Also, can you reproduce it with sshd in debug mode (eg /path/to/sshd -ddde)?  If
so, please attach (note: use "Create New Attachment") the debug log to this bug.
Comment 3 Darren Tucker 2005-01-19 20:01:15 AEDT
Also worth trying: patch #772 in bug #973
Comment 4 Darren Tucker 2005-01-20 22:28:42 AEDT
I think attachment #773 [details] should solve this (it fixed 973).  Could you please test
it and report?
Comment 5 Darren Tucker 2005-01-24 22:15:11 AEDT
I'm now pretty sure this is a dupe of bug #973.  Please reopen this bug if the
patch on bug #973 does not resolve this problem.

*** This bug has been marked as a duplicate of 973 ***
Comment 6 senthilkumar 2005-01-25 17:16:00 AEDT
I tested with the patch but still the sshd process goes to the high end of CPU 
cycles. This happens in hpux 11.11. This also happens irrespective of -r option 
for sshd.
Comment 7 Darren Tucker 2005-01-25 17:43:08 AEDT
OK I'll see if I can reproduce it.
Comment 8 Darren Tucker 2005-01-25 19:51:23 AEDT
Could you please provide the debugging requested in comment #2?  Specifically:

When it happens, can you run /usr/ucb/ps auxwww and pick out the pid of the
errant process?  It should have a few hints about what stage the process is at
in the process title.

Also, can you reproduce it with sshd in debug mode (eg /path/to/sshd -ddde)?  If
so, please attach (note: use "Create New Attachment") the debug log to this bug.
Comment 9 senthilkumar 2005-01-25 23:39:57 AEDT
Created attachment 788 [details]
ps output for sshd taking high end of CPU

Im not able to reproduce the problem in debug mode. Here are few things that I
found in hpux. The listening sshd goes to high end after the denied message of
tcp_wrappers. I checked it with top command. Particularly, in hpux 11.11 this
happens after 3rd blocked connection. However, in hpux 11.23 this happens after
5th connection.
Comment 10 Darren Tucker 2005-01-26 00:36:15 AEDT
Have not been able to reproduce on either Solaris or HP-UX.  Marking bug as
all-platform.

Senthil: a couple of questions:
 - what options did you build OpenSSH with?
 - what do the deny rules generally look like (eg deny by IP or DNS name)?
Comment 11 senthilkumar 2005-01-26 01:50:46 AEDT
>what options did you build OpenSSH with?
     ./configure --prefix --with-pam --with-ssl-dir --with-tcp-wrappers 
--with-zlib --with-kerberos5 --with-rand-helper --with-md5-passwords --with-4in6

>what do the deny rules generally look like (eg deny by IP or DNS name)?
     deny by IP.
      
 
Comment 12 Darren Tucker 2005-01-26 23:33:53 AEDT
Still can't reproduce it.  Another thing to try: doing a system call trace the
cpu-eating sshd and see what it's doing.

That's truss on Solaris and tusc on HP-UX.  (tusc is an unsupported tool
supplied by HP: ftp://ftp.cup.hp.com/dist/networking/tools/)
Comment 13 Albert Lunde 2005-02-15 04:40:21 AEDT
(In reply to comment #12)
> Still can't reproduce it.  Another thing to try: doing a system call trace the
> cpu-eating sshd and see what it's doing.
> 
> That's truss on Solaris and tusc on HP-UX.  (tusc is an unsupported tool
> supplied by HP: ftp://ftp.cup.hp.com/dist/networking/tools/)

I haven't found a way to reproduce the bug at will, though I suspect some
aggressive attack script may be involved, because of the timing of when I
started seeing the problems.

The patch doesn't seem to help me, on Solaris 8, either.

I was using CVS snapshorts of OpenSSL and OpenSSH:

openssl-0.9.7-stable-SNAP-20050110
openssh-SNAP-20050204

with this script to run configure:
- - -
#!/usr/bin/csh -vx
rm -f config.cache
set path=(/usr/local/openssl-r/bin $path)
setenv CC "/opt/SUNWspro/bin/cc"
#setenv CFLAGS "-I/usr/local/openssl-r/include -I/usr/local/zlib-1.2.1/include
-I/usr/local/include"
#setenv LFLAGS "-L/usr/local/openssl-r -L/usr/local/zlib-1.2.1/lib -L/usr/local/lib"
setenv CFLAGS "-I/usr/local/openssl-r/include  -I/usr/local/include"
setenv LFLAGS "-L/usr/local/openssl-r -L/usr/local/lib"
nohup sh ./configure  --prefix=/usr/local/openssh-r \
        --sysconfdir=/etc/openssh \
        --with-ssl-dir=/usr/local/openssl-r/lib \
        --with-prngd-socket=/local-adm-pub/prngd/egd-pool       \
        --with-tcp-wrappers \
        --with-4in6 \
                --with-pam      \
#               --with-zlib=/usr/local/zlib-1.2.1/lib   \
        --with-xauth=/usr/openwin/bin/xauth             \
        --with-default-path="/usr/bin:/usr/ucb:/usr/openwin/bin:/usr/local/bin" \
   >& my.configure.out.$$ &

#
echo my.configure.out.$$
#
ps -f
#
- - -
Comment 14 Albert Lunde 2005-02-15 04:51:52 AEDT
Created attachment 822 [details]
trust output on high CPU process

I tried various truss options, seeing something more verbose/informative.

This is a sample of the output of 
truss -f -rall -vall -D -p 4183 | & more

It seems to only be showing calls to read, a few of which take a lot of time,
and most of which don't take much.
Comment 15 Albert Lunde 2005-02-15 05:11:22 AEDT
Created attachment 823 [details]
ps -ef output for high CPU process used in truss

This is the output of ps -ef on the failing processes in the case I did the
truss on.
Comment 16 Albert Lunde 2005-02-15 05:30:46 AEDT
Created attachment 824 [details]
corresponding syslog messages from ssh, tcp_wrappers, prngd

These are the messages in syslog from about the time that the ps output seems
to imply the high-cpu processes started. I ran egrep '20:01:3' on the log file.


This file gets messages from prngd, sshd, and tcp_wrappers. I note there's an
error from prngd (which I'm using as a random number source), could that be a
factor in the problem? (I'm using prngd, because this server was a HP-UX box in
a previous life..)

The tcp_wrappers rules are first a number of exception rules of the form:

sshd,in.ftpd: SOME_ADDRESS  : rfc931 15 : keepalive : nice 1 : allow

Where SOME_ADDRESS is an IP address, a DNS host address, or a domain
suffix(.foo.example.com) for which we want to allow traffic:

There's a generic allow rule for on-campus traffic:

sshd: .ourdomain.edu : nice 1 : allow

anything else falls thru to a default deny rule:

ALL: ALL : deny

Could the use of rfc931 lookups trigger problems?
Comment 17 Albert Lunde 2005-02-15 05:40:45 AEDT
Created attachment 825 [details]
corresponding output of /usr/ucb/ps -auxwww 

This is the output of /usr/ucb/ps -auxwww, around the time the problem started,
as captured by a cron job running at 15 minute intervals.
Comment 18 Darren Tucker 2005-02-15 09:27:57 AEDT
(In reply to comment #14)
> It seems to only be showing calls to read, a few of which take a lot of time,
> and most of which don't take much.

This read loop is probably the cause of the CPU utilization.  It would appears
that something, somewhere keeps retrying a read after a descriptor closes.

A couple of suggestions for more getting more info:
Can you use lsof or similar to find out what the descriptor is?  (It's 3 in this
case but there's no guarantee it'll be the same, so check with truss first).

Attach a debugger to the looping process and get a backtrace to find out where
it's stuck.  With gdb this is something like:
# gdb /path/to/sshd pid
(gdb) backtrace

If you use the sshd before it's stripped (ie from the build dir) then you will
get better info.  You don't have to be running the one with the debug symbols,
you can be running the stripped sshd and use the one with the debug symbols with
gdb as long as the two sshd's match.
Comment 19 Albert Lunde 2005-02-19 11:34:46 AEDT
I was able to reproduce the problem, by creating a tcp_wrappers deny rule for a
local box, running a test server on a high port, and running the "ab" (apache
benchmark utility) against that port.

In fact it was even easier to mess it up by running "ab" against the port
without the tcp_wrappers deny rule.

This binary was build with the Sun compiler suite, not gcc, so I used the
corresponding "dpx" debugger traceback feature.

It looks like it's looping deep in the random number generation code; several
traces all looked similar.

I suspect the common factor between me and the other case is use of prngd or a
similar daemon, rather than the sun /dev/random patch. This set-up was brought
over from HP-UX which didn't have a /dev/random patch, and the other report is
from HP-UX. Plus, there were syslog errors from prngd.
Comment 20 Albert Lunde 2005-02-19 11:38:53 AEDT
Created attachment 830 [details]
ps output and dbx traceback

This is the results of running

ab -n 100 -t 30 -c 50 http://HOSTNAME:4022/

on the test server port from a blocked host.

Afterwards I did a ps and a traceback with dbx on one of the runaway processes.
Comment 21 Darren Tucker 2005-02-19 13:55:03 AEDT
Created attachment 831 [details]
check for closed fd in openssl's RAND_query_egd_bytes

This appears to be a bug in OpenSSL's prngd interface.

I think what is happening is prngd is closing sockets (the "fairness" limits? 
if so does the problem occur with prngd-0.9.27?) and OpenSSL does not handle
the the descriptor closing (read() will return zero) and repeatedly retries the
read.

Please try this patch to OpenSSL (against 0.9.7e but may apply to others),
rebuild then rebuild OpenSSH with the new OpenSSL.
Comment 22 Darren Tucker 2005-02-19 18:41:28 AEDT
Created attachment 832 [details]
a better fix for closed fd in openssl's RAND_query_egd_bytes
Comment 23 Darren Tucker 2005-02-19 18:57:39 AEDT
Created attachment 833 [details]
hack prngd to provoke failure at will

This is a nasty hack to prngd in case anyone wants to reproduce this: apply to
prngd-0.9.29 build and run "./prngd /var/run/egd-pool", then run "openssl rand
-out /dev/null 512" and watch the CPU burn...

I have submitted this and the fix upstream to OpenSSL's request tracker (via
email, no request number yet).
Comment 24 Damien Miller 2005-02-19 19:26:56 AEDT
Comment on attachment 832 [details]
a better fix for closed fd in openssl's RAND_query_egd_bytes

> 	        num = read(fd, egdbuf, 1);
>-	        if (num >= 0)
>+	        if (num == 0)
>+			goto err;	/* descriptor closed */
>+		else if (num > 0)

What about the (num == -1) case? Does it cope with EINTR and EAGAIN?
Comment 25 Darren Tucker 2005-02-19 19:33:18 AEDT
(In reply to comment #24)
> What about the (num == -1) case? Does it cope with EINTR and EAGAIN?

Yeah, that's already handled a bit further down, just out reach of the context.
 The entire block of code (compacted to save space) is:

                if (num == 0)
                        goto err;       /* descriptor closed */
                else if (num > 0)
                    numbytes += num;
                else {
                    switch (errno) {
#ifdef EINTR
                        case EINTR:
#endif
#ifdef EAGAIN
                        case EAGAIN:
#endif
                            /* No error, try again */
                            break;
                        default:
                            ret = -1;
                            goto err;   /* failure */
                        }
Comment 26 Darren Tucker 2005-02-19 21:28:14 AEDT
The OpenSSL folks seem to have accepted the bug and applied the patch:
http://www.aet.tu-cottbus.de/rt2/Ticket/Display.html?id=1014

If the patch fixes your problem, please close this bug as it's not a problem
with OpenSSH.

Comment 27 Albert Lunde 2005-02-26 00:54:30 AEDT
With the patch 832 to OpenSSL, test loads that previously produced several
run-away processes, produced no extra lingering processes. 

(This was based on the same CVS snapshots of OpenSSL and OpenSSH source used in
my truss and dbx debug tests, which includes the previously suggested patch from
bug 973.)
Comment 28 Darren Tucker 2006-10-07 11:37:45 AEST
Change all RESOLVED bug to CLOSED with the exception of the ones fixed post-4.4.