Bug 33 - sshd can run out of file descriptors and spin
Summary: sshd can run out of file descriptors and spin
Status: CLOSED WORKSFORME
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: sshd (show other bugs)
Version: -current
Hardware: All Solaris
: P2 major
Assignee: OpenSSH Bugzilla mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2001-12-06 12:03 AEDT by Tony Doan
Modified: 2004-04-14 12:24 AEST (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tony Doan 2001-12-06 12:03:05 AEDT
Running openssh2.9.9p2 on Sparc(32) Solaris 2.6 (Ultra 1) we have found a reproducable situation where sshd can be run out of file descriptors and put into a tight loop soaking up cpu cycles.

The initial symptom of this issue is users reporting generally sluggish performance on the host. They usually run top and find that there is an sshd at the top of the list and soaking up between 50% and 75% of the cpu.

I then ran a truss on the process to find out what it was working so hard on:
A truss of the process reveals:

poll(0xEFFFD3A0, 5, -1)                         = 1
accept(8, 0xEFFFF318, 0xEFFFB314)               Err#24 EMFILE
fstat(-1, 0xEFFFA938)                           Err#9 EBADF
poll(0xEFFFD3A0, 5, -1)                         = 1
accept(8, 0xEFFFF318, 0xEFFFB314)               Err#24 EMFILE
fstat(-1, 0xEFFFA938)                           Err#9 EBADF
poll(0xEFFFD3A0, 5, -1)                         = 1
accept(8, 0xEFFFF318, 0xEFFFB314)               Err#24 EMFILE
fstat(-1, 0xEFFFA938)                           Err#9 EBADF

Err#9 is "Bad file descriptor"
Err#24 is "Too many open files"

Running out of file descriptors wasn't something I was expecting to find. Looking into it I found that my users were quite often using ghostscript and leaving it running for long periods of time (i.e. months). It would seem that ghostscript in particular often creates and tears down windows (which creates a new socket when you are tunneling through ssh). However, it seems that it doesn't tear them down very well and sshd ends up with a bunch of half dead sockets and their file descriptors still open.

To verify this I did the following as a test:
1) ssh to the remote solaris box from my desktop
2) find the sshd on the remote host that is running my connectin (yea, ps!)
3) run /usr/proc/bin/pfiles to find out how many open file descriptions an sshd has before any X forwarding has occured (answer: 9)
4) Open an X app (xclock)
5) Rerun pfiles and note that the fd count has gone to 10.
6) Open another X app
7) Rerun pfiles and note that the fd count has gone to 11.
8) Close an X app
9) Rerun pfiles a note that the fd count has gone back to 10.
etc.
10) Run gs and note that the fs count has gone back up by 1.
11) Do more random work in gs editing, previewing etc (which tend to open and close windows).
12) Occasionally run pfiles and note that the number of fd's constantly grows and does not shrink
13) Artificially inflate the number of fd's up to 64 (default max on solaris) using gs and more xclocks
14) When the fs count reaches 64 watch sshd go into a tight loop and start soaking up cpu when you try to open the 65th window.
14a) note that no new x windows (whether new process or window of an existing process) can be opened.
15) Verify with truss that sshd is running the same loop of system calls as shown above.

At this point we were pretty confident that we had a bug in both gs and sshd. Thinking the sshd bug more important I went to the source.

I found this snippet in sshd.c (RCSID("$OpenBSD: sshd.c,v 1.204 2001/08/23 17:59:31 camield Exp $");)

starts at line 994:
                              newsock = accept(listen_socks[i], (struct sockaddr *)&from,
                                    &fromlen);
                                if (newsock < 0) {
                                        if (errno != EINTR && errno != EWOULDBLOCK)
                                                error("accept: %.100s", strerror(errno));
                                        continue;
                                }
                                if (fcntl(newsock, F_SETFL, 0) < 0) {
                                        error("newsock del O_NONBLOCK: %s", strerror(errno));
                                        continue;

It would seem the result from accept is being checked but not properly responded to when the result is -1. All the if(newsock) does is decided whether to log something or not. The continue is executed either way.

I may be totally off base on this being the right part of the code responsible for the behavior I am seeing, but I thought it might be a good place to start. I did verify that this piece of code is still the same in openssh-3.0.2p1 (RCSID("$OpenBSD: sshd.c,v 1.209 2001/11/10 13:19:45 markus Exp $");)

Thanks for your attention,
Tony Doan
tdoan@tdoan.com
Comment 1 Markus Friedl 2001-12-11 06:52:53 AEDT
could you please provide a
	ssh -v -v -v 
output?
Comment 2 Kevin Steves 2002-01-06 12:05:57 AEDT
no additional info provided.
Comment 3 Damien Miller 2004-04-14 12:24:17 AEST
Mass change of RESOLVED bugs to CLOSED