Running openssh2.9.9p2 on Sparc(32) Solaris 2.6 (Ultra 1) we have found a reproducable situation where sshd can be run out of file descriptors and put into a tight loop soaking up cpu cycles. The initial symptom of this issue is users reporting generally sluggish performance on the host. They usually run top and find that there is an sshd at the top of the list and soaking up between 50% and 75% of the cpu. I then ran a truss on the process to find out what it was working so hard on: A truss of the process reveals: poll(0xEFFFD3A0, 5, -1) = 1 accept(8, 0xEFFFF318, 0xEFFFB314) Err#24 EMFILE fstat(-1, 0xEFFFA938) Err#9 EBADF poll(0xEFFFD3A0, 5, -1) = 1 accept(8, 0xEFFFF318, 0xEFFFB314) Err#24 EMFILE fstat(-1, 0xEFFFA938) Err#9 EBADF poll(0xEFFFD3A0, 5, -1) = 1 accept(8, 0xEFFFF318, 0xEFFFB314) Err#24 EMFILE fstat(-1, 0xEFFFA938) Err#9 EBADF Err#9 is "Bad file descriptor" Err#24 is "Too many open files" Running out of file descriptors wasn't something I was expecting to find. Looking into it I found that my users were quite often using ghostscript and leaving it running for long periods of time (i.e. months). It would seem that ghostscript in particular often creates and tears down windows (which creates a new socket when you are tunneling through ssh). However, it seems that it doesn't tear them down very well and sshd ends up with a bunch of half dead sockets and their file descriptors still open. To verify this I did the following as a test: 1) ssh to the remote solaris box from my desktop 2) find the sshd on the remote host that is running my connectin (yea, ps!) 3) run /usr/proc/bin/pfiles to find out how many open file descriptions an sshd has before any X forwarding has occured (answer: 9) 4) Open an X app (xclock) 5) Rerun pfiles and note that the fd count has gone to 10. 6) Open another X app 7) Rerun pfiles and note that the fd count has gone to 11. 8) Close an X app 9) Rerun pfiles a note that the fd count has gone back to 10. etc. 10) Run gs and note that the fs count has gone back up by 1. 11) Do more random work in gs editing, previewing etc (which tend to open and close windows). 12) Occasionally run pfiles and note that the number of fd's constantly grows and does not shrink 13) Artificially inflate the number of fd's up to 64 (default max on solaris) using gs and more xclocks 14) When the fs count reaches 64 watch sshd go into a tight loop and start soaking up cpu when you try to open the 65th window. 14a) note that no new x windows (whether new process or window of an existing process) can be opened. 15) Verify with truss that sshd is running the same loop of system calls as shown above. At this point we were pretty confident that we had a bug in both gs and sshd. Thinking the sshd bug more important I went to the source. I found this snippet in sshd.c (RCSID("$OpenBSD: sshd.c,v 1.204 2001/08/23 17:59:31 camield Exp $");) starts at line 994: newsock = accept(listen_socks[i], (struct sockaddr *)&from, &fromlen); if (newsock < 0) { if (errno != EINTR && errno != EWOULDBLOCK) error("accept: %.100s", strerror(errno)); continue; } if (fcntl(newsock, F_SETFL, 0) < 0) { error("newsock del O_NONBLOCK: %s", strerror(errno)); continue; It would seem the result from accept is being checked but not properly responded to when the result is -1. All the if(newsock) does is decided whether to log something or not. The continue is executed either way. I may be totally off base on this being the right part of the code responsible for the behavior I am seeing, but I thought it might be a good place to start. I did verify that this piece of code is still the same in openssh-3.0.2p1 (RCSID("$OpenBSD: sshd.c,v 1.209 2001/11/10 13:19:45 markus Exp $");) Thanks for your attention, Tony Doan tdoan@tdoan.com
could you please provide a ssh -v -v -v output?
no additional info provided.
Mass change of RESOLVED bugs to CLOSED