| Summary: | sshd slow connect with 'UseDNS yes' | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Portable OpenSSH | Reporter: | Brian <brian.p.stamper> | ||||||||||
| Component: | sshd | Assignee: | Assigned to nobody <unassigned-bugs> | ||||||||||
| Status: | CLOSED WORKSFORME | ||||||||||||
| Severity: | major | CC: | brian.p.stamper, dtucker | ||||||||||
| Priority: | P2 | ||||||||||||
| Version: | 5.2p1 | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | Linux | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Brian
2009-10-27 10:30:43 AEDT
could you please run the server in debug mode (eg "/path/to/sshd -ddde -p 2222" then point your client at port 2222) and add the output as an attachment? please indicate where in the output the delay(s) occur. Created attachment 1711 [details]
output of sshd -ddde
Attached as requested. It's not one spot in the log where it pauses. I put numbers in roughly once a second at each step the debug output hung. So there are roughly 5 spots it hangs for 3-4 seconds.
It's probably also worth noting that there are delays cleaning up the session on the server side too. This is what happens when I disconnect:
debug1: Received SIGCHLD.
debug1: session_by_pid: pid 16135
debug1: session_exit_message: session 0 channel 0 pid 16135
debug2: channel 0: request exit-status confirm 0
debug1: session_exit_message: release channel 0
debug2: channel 0: write failed
debug2: channel 0: close_write
debug2: channel 0: send eow
debug2: channel 0: output open -> closed
debug3: mm_request_send entering: type 28
debug3: monitor_read: checking request 28
debug3: mm_answer_pty_cleanup entering
debug1: session_by_tty: session 0 tty /dev/pts/2
debug3: mm_session_close: session 0 pid 16134
debug3: mm_session_close: tty /dev/pts/2 ptyfd 5
debug1: session_pty_cleanup: session 0 release /dev/pts/2
debug3: session_unused: session id 0 unused
debug3: mm_request_receive entering
debug2: channel 0: read<=0 rfd 10 len -1
debug2: channel 0: read failed
debug2: channel 0: close_read
debug2: channel 0: input open -> drain
debug2: channel 0: ibuf empty
debug2: channel 0: send eof
debug2: channel 0: input drain -> closed
debug2: channel 0: send close
debug2: notify_done: reading
debug3: channel 0: will not send data after close
debug2: channel 0: rcvd close
debug3: channel 0: will not send data after close
debug2: channel 0: is dead
debug2: channel 0: gc: notify user
debug1: session_by_channel: session 0 channel 0
debug1: session_close_by_channel: channel 0 child 0
debug1: session_close: session 0 pid 0
debug3: session_unused: session id 0 unused
debug2: channel 0: gc: user detached
debug2: channel 0: is dead
debug2: channel 0: garbage collecting
debug1: channel 0: free: server-session, nchannels 1
debug3: channel 0: status: The following connections are open:
#0 server-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1)
debug3: channel 0: close_fds r -1 w -1 e -1 c -1
Connection closed by 143.232.109.139
debug1: do_cleanup
debug3: PAM: sshpam_thread_cleanup entering
Transferred: sent 3320, received 1728 bytes
Closing connection to 143.232.109.139 port 45031
debug3: mm_request_send entering: type 59
debug3: monitor_read: checking request 59
debug3: mm_answer_term: tearing down sessions
debug1: PAM: cleanup
debug1: PAM: deleting credentials
1
2
3
4
debug1: PAM: closing session
1
2
3
4
A tcpdump shows that every delay in the login and disconnect process seems to correspond to a request to my DNS server. I'm unsure why sshd would need to do a lookup 8 distinct times per connection and 2 more on disconnect.
Here is the response time from my dns server forward and reverse, so it's not just DNS performance or latency:
[root@corvus ~]# time nslookup flux.arc.nasa.gov
Server: 128.102.0.34
Address: 128.102.0.34#53
Name: flux.arc.nasa.gov
Address: 143.232.109.139
real 0m0.006s
user 0m0.001s
sys 0m0.003s
[root@corvus ~]# time nslookup 143.232.109.139
Server: 128.102.0.34
Address: 128.102.0.34#53
139.109.232.143.in-addr.arpa name = flux.arc.nasa.gov.
real 0m0.006s
user 0m0.001s
sys 0m0.004s
[root@corvus ~]#
Comment on attachment 1711 [details] output of sshd -ddde >debug1: do_pam_account: called >1 2 3 4 >debug3: PAM: do_pam_account pam_acct_mgmt = 0 (Success) OK, we'll use this example since it's probably the simplest. The code that does this is in auth-pam.c:do_pam_account(): debug("%s: called", __func__); if (sshpam_account_status != -1) return (sshpam_account_status); sshpam_err = pam_acct_mgmt(sshpam_handle, 0); debug3("PAM: %s pam_acct_mgmt = %d (%s)", __func__, sshpam_err, pam_strerror(sshpam_handle, sshpam_err)); where previously the hostname was set via PAM_RHOST: sshpam_err = pam_set_item(sshpam_handle, PAM_RHOST, pam_rhost); So in this case the blocking is happening inside either the PAM library or a PAM module. You can confirm this by repeating the same test but UsePam=no. There is one other delay marked in the output where the pty is allocated. I suspect you will still see the delay at the pty allocation but overall it will be much faster (because sshd caches the result of the name lookup). I don't know why the lookups inside PAM take so long though. Can you capture the name lookups? either strace/truss "/path/to/sshd -D" and pick the requests out of the output or run "tcpdump -s 1500 port 53" while connecting. I suspect you'll find that it's either IPv6 AAAA lookups or their inverse. Thanks for the response. I probably can't work this today. I'll get you the output tomorrow. Created attachment 1716 [details]
pcap of one connection
pcap of one connection. did not include disconnect.
Created attachment 1717 [details]
strace /usr/sbin/sshd -D
This didn't seem to show a lot, though you may be better able to use the info than I.
Created attachment 1718 [details]
strace /usr/sbin/sshd -ddde
This seemed to have more info.
(In reply to comment #5) > Created an attachment (id=1716) [details] > pcap of one connection > > pcap of one connection. did not include disconnect. I haven't gone through the other output yet but this seems suspicious. Go from the start: 03:37:12.425222 corvus.arc.nasa.gov.45215 > ns2.arc.nasa.gov.domain: 20988+ PTR? 139.109.232.143.in-addr.arpa. (46) (DF) 03:37:12.428942 ns2.arc.nasa.gov.domain > corvus.arc.nasa.gov.45215: 20988* 1/3/3 PTR flux.arc.nasa.gov. (179) a reverse lookup of 143.232.109.139 (ipv4) and immediate response from ns2. Note the transaction ID "20988". 03:37:12.429217 corvus.arc.nasa.gov.44863 > ns2.arc.nasa.gov.domain: 55778+ A? flux.arc.nasa.gov. (35) (DF) 03:37:12.433199 ns2.arc.nasa.gov.domain > corvus.arc.nasa.gov.44863: 55778* 1/3/3 A flux.arc.nasa.gov (153) A forward lookup of flux.arc.nasa.gov (ipv4). Also answered immediately by ns2. 03:37:15.155167 corvus.arc.nasa.gov.58329 > ns2.arc.nasa.gov.domain: 26977+ A? flux.arc.nasa.gov. (35) (DF) 03:37:15.155180 corvus.arc.nasa.gov.58329 > ns2.arc.nasa.gov.domain: 24837+ AAAA? flux.arc.nasa.gov. (35) (DF) lookups of corvus.arc.nasa.gov for ipv4 (A) and ipv6 (AAAA) in parallel. no response, times out at 5sec. 03:37:20.154422 corvus.arc.nasa.gov.57585 > ns1.arc.nasa.gov.domain: 26977+ A? flux.arc.nasa.gov. (35) (DF) 03:37:20.154444 corvus.arc.nasa.gov.57585 > ns1.arc.nasa.gov.domain: 24837+ AAAA? flux.arc.nasa.gov. (35) (DF) 03:37:20.156796 ns1.arc.nasa.gov.domain > corvus.arc.nasa.gov.57585: 26977* 1/3/3 A flux.arc.nasa.gov (153) retries the request on ns1, which responds only to the A (ipv4) request. this pattern is repeated later on ns2: 03:37:30.167249 corvus.arc.nasa.gov.57935 > ns2.arc.nasa.gov.domain: 39226+ A? flux.arc.nasa.gov. (35) (DF) 03:37:30.167283 corvus.arc.nasa.gov.57935 > ns2.arc.nasa.gov.domain: 59667+ AAAA? flux.arc.nasa.gov. (35) (DF) 03:37:30.170583 ns2.arc.nasa.gov.domain > corvus.arc.nasa.gov.57935: 39226* 1/3/3 A flux.arc.nasa.gov (153) 03:37:35.167403 corvus.arc.nasa.gov.57935 > ns2.arc.nasa.gov.domain: 39226+ A? flux.arc.nasa.gov. (35) (DF) I susupect your nameservers are silently dropping AAAA lookups (this is common enough that there's an RFC about it, RFC4074). Try these: host -t A flux.arc.nasa.gov. ns1.arc.nasa.gov. host -t AAAA flux.arc.nasa.gov. ns1.arc.nasa.gov. host -t A flux.arc.nasa.gov. ns2.arc.nasa.gov. host -t AAAA flux.arc.nasa.gov. ns2.arc.nasa.gov. > Try these:
>
> host -t A flux.arc.nasa.gov. ns1.arc.nasa.gov.
> host -t AAAA flux.arc.nasa.gov. ns1.arc.nasa.gov.
> host -t A flux.arc.nasa.gov. ns2.arc.nasa.gov.
> host -t AAAA flux.arc.nasa.gov. ns2.arc.nasa.gov.
[root@corvus ~]# time host -t A flux.arc.nasa.gov. ns1.arc.nasa.gov.
Using domain server:
Name: ns1.arc.nasa.gov.
Address: 143.232.252.34#53
Aliases:
flux.arc.nasa.gov has address 143.232.109.139
real 0m5.026s
user 0m0.002s
sys 0m0.003s
[root@corvus ~]# time host -t AAAA flux.arc.nasa.gov. ns1.arc.nasa.gov.
Using domain server:
Name: ns1.arc.nasa.gov.
Address: 143.232.252.34#53
Aliases:
flux.arc.nasa.gov has no AAAA record
real 0m5.009s
user 0m0.001s
sys 0m0.005s
[root@corvus ~]# time host -t A flux.arc.nasa.gov. ns2.arc.nasa.gov.
Using domain server:
Name: ns2.arc.nasa.gov.
Address: 128.102.0.34#53
Aliases:
flux.arc.nasa.gov has address 143.232.109.139
real 0m5.008s
user 0m0.002s
sys 0m0.003s
[root@corvus ~]# time host -t AAAA flux.arc.nasa.gov. ns2.arc.nasa.gov.
Using domain server:
Name: ns2.arc.nasa.gov.
Address: 128.102.0.34#53
Aliases:
flux.arc.nasa.gov has no AAAA record
real 0m5.007s
user 0m0.003s
sys 0m0.001s
(In reply to comment #9) > [root@corvus ~]# time host -t A flux.arc.nasa.gov. ns1.arc.nasa.gov. [...] > flux.arc.nasa.gov has address 143.232.109.139 > > real 0m5.026s There's your problem: your DNS is taking 5s to respond. I'm surprised that it's happening for A records too given what I saw in the packet trace, but I expected to see this for AAAA. So your DNS (or resolver) is slow, and that's compounded by something in PAM making multiple trips to the well. The options I can think of are: * fix your DNS or resolver * make your resolver send only IPv4 requests * live with UseDNS=no * disable PAM. You also could try building OpenSSH with "configure --with-cflags=-DBROKEN_GETADDRINFO". This will force the use of the built-in getaddrinfo replacement that happens to speak only IPv4, but it probably won't help the PAM bits. I suspect this will make minimal difference though.
> There's your problem: your DNS is taking 5s to respond.
Well, something odd is going on. I'm not convinced it's as simple as DNS is taking 5 seconds to respond, though I may be willing to concede the issue is not with openssh. See the following:
[root@corvus ~]# time nslookup ns1.arc.nasa.gov 143.232.252.34
Server: 143.232.252.34
Address: 143.232.252.34#53
Name: ns1.arc.nasa.gov
Address: 143.232.252.34
real 0m0.005s
user 0m0.001s
sys 0m0.002s
[root@corvus ~]# time nslookup ns1.arc.nasa.gov ns1.arc.nasa.gov
Server: ns1.arc.nasa.gov
Address: 143.232.252.34#53
Name: ns1.arc.nasa.gov
Address: 143.232.252.34
real 0m5.008s
user 0m0.001s
sys 0m0.003s
[root@corvus ~]# time host -t A flux.arc.nasa.gov ns1.arc.nasa.gov
Using domain server:
Name: ns1.arc.nasa.gov
Address: 143.232.252.34#53
Aliases:
flux.arc.nasa.gov has address 143.232.109.139
real 0m5.008s
user 0m0.001s
sys 0m0.003s
[root@corvus ~]# time host -t A flux.arc.nasa.gov 143.232.252.34
Using domain server:
Name: 143.232.252.34
Address: 143.232.252.34#53
Aliases:
flux.arc.nasa.gov has address 143.232.109.139
real 0m0.005s
user 0m0.000s
sys 0m0.004s
[root@corvus ~]# time host -t AAAA flux.arc.nasa.gov ns1.arc.nasa.gov
Using domain server:
Name: ns1.arc.nasa.gov
Address: 143.232.252.34#53
Aliases:
flux.arc.nasa.gov has no AAAA record
real 0m5.008s
user 0m0.001s
sys 0m0.004s
[root@corvus ~]# time host -t AAAA flux.arc.nasa.gov 143.232.252.34
Using domain server:
Name: 143.232.252.34
Address: 143.232.252.34#53
Aliases:
flux.arc.nasa.gov has no AAAA record
real 0m0.005s
user 0m0.002s
sys 0m0.003s
[root@corvus ~]#
[root@corvus ~]# grep hosts /etc/nsswitch.conf
#hosts: db files nisplus nis dns
hosts: files dns
[root@corvus ~]# cat /etc/resolv.conf
search arc.nasa.gov
nameserver 128.102.0.34
nameserver 143.232.252.34
[root@corvus ~]#
Something with the resolver maybe? Clearly basic name resolution of the nameserver is plenty fast. But when giving the hostname of the dns server as an argument, it takes 5 seconds to respond.
[root@corvus ~]# time nslookup ns1.arc.nasa.gov
Server: 128.102.0.34
Address: 128.102.0.34#53
Name: ns1.arc.nasa.gov
Address: 143.232.252.34
real 0m0.006s
user 0m0.001s
sys 0m0.003s
[root@corvus ~]# time nslookup ns2.arc.nasa.gov
Server: 128.102.0.34
Address: 128.102.0.34#53
Name: ns2.arc.nasa.gov
Address: 128.102.0.34
real 0m0.006s
user 0m0.001s
sys 0m0.003s
[root@corvus ~]# time host -t AAAA flux.arc.nasa.gov
flux.arc.nasa.gov has no AAAA record
real 0m0.006s
user 0m0.000s
sys 0m0.003s
[root@corvus ~]#
I'm not sure what changes when giving the hostname as an argument as oppose to no argument or the ip.
"nslookup ns1.arc.nasa.gov"
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
09:48:23.191397 IP corvus.arc.nasa.gov.50677 > ns1.arc.nasa.gov.domain: 62346+ A? ns1.arc.nasa.gov. (34)
09:48:23.191831 IP ns1.arc.nasa.gov.domain > corvus.arc.nasa.gov.50677: 62346* 1/3/2 A ns1.arc.nasa.gov (132)
09:48:23.191921 IP corvus.arc.nasa.gov.40824 > ns1.arc.nasa.gov.domain: 61200+ PTR? 34.252.232.143.in-addr.arpa. (45)
09:48:23.192382 IP ns1.arc.nasa.gov.domain > corvus.arc.nasa.gov.40824: 61200* 1/3/3 (173)
"nslookup ns1.arc.nasa.gov ns1.arc.nasa.gov"
09:48:31.335862 IP corvus.arc.nasa.gov.53838 > ns1.arc.nasa.gov.domain: 16500+ A? ns1.arc.nasa.gov. (34)
09:48:31.335933 IP corvus.arc.nasa.gov.53838 > ns1.arc.nasa.gov.domain: 59525+ AAAA? ns1.arc.nasa.gov. (34)
09:48:31.336346 IP ns1.arc.nasa.gov.domain > corvus.arc.nasa.gov.53838: 16500* 1/3/2 A ns1.arc.nasa.gov (132)
09:48:36.336413 IP corvus.arc.nasa.gov.53838 > ns1.arc.nasa.gov.domain: 16500+ A? ns1.arc.nasa.gov. (34)
09:48:36.336812 IP ns1.arc.nasa.gov.domain > corvus.arc.nasa.gov.53838: 16500* 1/3/2 A ns1.arc.nasa.gov (132)
09:48:36.336920 IP corvus.arc.nasa.gov.53838 > ns1.arc.nasa.gov.domain: 59525+ AAAA? ns1.arc.nasa.gov. (34)
09:48:36.337318 IP ns1.arc.nasa.gov.domain > corvus.arc.nasa.gov.53838: 59525* 0/1/0 (81)
09:48:36.338002 IP corvus.arc.nasa.gov.43373 > ns1.arc.nasa.gov.domain: 56964+ A? ns1.arc.nasa.gov. (34)
09:48:36.338456 IP ns1.arc.nasa.gov.domain > corvus.arc.nasa.gov.43373: 56964* 1/3/2 A ns1.arc.nasa.gov (132)
Unfortunately another group here is responsible for the DNS servers, I will engage them as well and see if they know what's causing the behavior. It looks as though something changed in the fedora resolver though. I don't get this behavior on fedora 6 and 7 boxes using the same server from the same subnet.
[root@old-flux ~]# uname -r
2.6.22.1-32.fc6
[root@old-flux ~]# time nslookup ns1.arc.nasa.gov
Server: 128.102.0.34
Address: 128.102.0.34#53
Name: ns1.arc.nasa.gov
Address: 143.232.252.34
real 0m0.456s
user 0m0.004s
sys 0m0.002s
[root@old-flux ~]# time nslookup ns1.arc.nasa.gov ns1.arc.nasa.gov
Server: ns1.arc.nasa.gov
Address: 143.232.252.34#53
Name: ns1.arc.nasa.gov
Address: 143.232.252.34
real 0m0.009s
user 0m0.004s
sys 0m0.003s
[root@old-flux ~]#
[root@old-flux ~]# host -t A flux.arc.nasa.gov. ns1.arc.nasa.gov. Using domain server: Name: ns1.arc.nasa.gov. Address: 143.232.252.34#53 Aliases: flux.arc.nasa.gov has address 143.232.109.139 [root@old-flux ~]# host -t AAAA flux.arc.nasa.gov. ns1.arc.nasa.gov. Using domain server: Name: ns1.arc.nasa.gov. Address: 143.232.252.34#53 Aliases: flux.arc.nasa.gov has no AAAA record [root@old-flux ~]# [root@old-flux ~]# cat /etc/resolv.conf search arc.nasa.gov nameserver 128.102.0.34 nameserver 143.232.252.34 [root@old-flux ~]# grep hosts /etc/nsswitch.conf #hosts: db files nisplus nis dns hosts: files dns [root@old-flux ~]# More to suggest this is a change in the resolver on Fedora. Using the name servers from the fine folks at google and redhat, I seem to see the same behavior. So I'm not sure where I should be looking at this point. Do you know of a public DNS server that is definitely configured properly for ipv6? Since I don't see this behavior from hosts that are not fedora 11 on the same subnets, I don't believe it's a router or DNS server issue. If you concur, I'll move this issue to the fedora issues list. -Brian [root@centroid ~]# time nslookup ns2.google.com 216.239.34.10 Server: 216.239.34.10 Address: 216.239.34.10#53 Name: ns2.google.com Address: 216.239.34.10 real 0m0.094s user 0m0.001s sys 0m0.003s [root@centroid ~]# time nslookup ns2.google.com ns2.google.com Server: ns2.google.com Address: 216.239.34.10#53 Name: ns2.google.com Address: 216.239.34.10 real 0m5.280s user 0m0.002s sys 0m0.002s [root@centroid ~]# grep name /etc/resolv.conf #nameserver 128.102.0.34 #nameserver 143.232.252.34 nameserver 216.239.34.10 ======================================================== [root@centroid ~]# time nslookup ns1.redhat.com Server: 66.187.233.210 Address: 66.187.233.210#53 Name: ns1.redhat.com Address: 66.187.233.210 real 0m0.091s user 0m0.002s sys 0m0.002s [root@centroid ~]# time nslookup ns1.redhat.com ns1.redhat.com Server: ns1.redhat.com Address: 66.187.233.210#53 Name: ns1.redhat.com Address: 66.187.233.210 real 0m5.267s user 0m0.001s sys 0m0.004s [root@centroid ~]# time nslookup ns1.redhat.com 66.187.233.210 Server: 66.187.233.210 Address: 66.187.233.210#53 Name: ns1.redhat.com Address: 66.187.233.210 real 0m0.090s user 0m0.001s sys 0m0.002s [root@centroid ~]# cat /etc/resolv.conf search arc.nasa.gov #nameserver 128.102.0.34 #nameserver 143.232.252.34 nameserver 66.187.233.210 Well, I've identified a workaround, not really a fix. And it's really just disabling ipv6. I'm not convinced my nameservers are the problem, rather I think there's something broken with Fedora 11's use of ipv6, or possibly the linux kernel post 2.6.27. Added to /etc/modprobe.d/aliases.conf: (new file I created) install net-pf-10 /bin/true install ipv6 /bin/true Added to /etc/modprobe.d/blacklist.conf: # ipv6 die die die blacklist ipv6 symptoms have completely cleared up, and I don't use ipv6, so I'm happy. Thank you for your assistance and patience in troubleshooting what turned out to be not your problem. -Brian (In reply to comment #14) > Well, I've identified a workaround, not really a fix. And it's really > just disabling ipv6. I'm not convinced my nameservers are the problem, > rather I think there's something broken with Fedora 11's use of ipv6, > or possibly the linux kernel post 2.6.27. I mentioned earlier that it might be in the resolver, and based on your results from disabling ipv6, I suspect that it's the resolver that it's got some odd interaction with that. > Thank you for your assistance and patience in troubleshooting what > turned out to be not your problem. You're welcome. Mass move of bugs RESOLVED->CLOSED following the release of openssh-5.5p1 |