Bug 2565 - High baud rate gets sent, solaris closes pty
Summary: High baud rate gets sent, solaris closes pty
Status: CLOSED MOVED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: sshd (show other bugs)
Version: 7.1p2
Hardware: SPARC Solaris
: P5 minor
Assignee: Tomas Kuthan
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-18 12:49 AEST by Tim Hogard
Modified: 2021-04-23 15:00 AEST (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Hogard 2016-04-18 12:49:12 AEST
It seems that trying to set the tty baud rate in Solaris to reasonable modern values is broken resulting in the pseudo tty output being closed.  This effects Solaris 11.3 back to at least Sol 9.

I can replicate it from OS X:
stty ospeed 57600
ssh solaris
Password: ....
(no output)

I can force it 9600 to fix the problem with code that verifies the problem and fixes it:
 case TTY_OP_ISPEED_PROTO2 (and TTY_OP_OSPEED_PROTO2)
...
 baud = packet_get_int();
+debug("ibaud=%d",baud);
+baud=9600;

On some clients (such as QNX) stty can be used as a work around.  "stty ispeed 9600;stty ospeed 9600;ssh broken-host" will work.

Is it appropriate to try to fix this in sshd?  If so, then should the proper way to fix this be something along the line of if ( baud > MAX_BAUD) baud=MAX_BAUD;?
I can't see a way of setting MAX_BAUD other than an config option hack with a list for broken systems.  Even very old Solaris has B57600 in termios.h so that can't be used as an indicator.
Comment 1 Damien Miller 2016-04-18 20:59:31 AEST
does setting ospeed actually do anything on a PTY? AFAIK it doesn't on OS X, BSD or Linux.

If the terminal is being closed as a result of a failed cfsetispeed() call then that indicates a kernel or (possibly) libc problem - ssh doesn't exit when that fails, it just logs an error and continues.
Comment 2 Darren Tucker 2016-07-20 13:48:15 AEST
(In reply to Damien Miller from comment #1)
> does setting ospeed actually do anything on a PTY? AFAIK it doesn't
> on OS X, BSD or Linux.
> 
> If the terminal is being closed as a result of a failed
> cfsetispeed() call then that indicates a kernel or (possibly) libc
> problem - ssh doesn't exit when that fails, it just logs an error
> and continues.

I can reproduce this with an OpenBSD client and Solaris 11 server.  The cfsetispeed succeeds but it looks like the data from the pty slave never makes it back to the master.

truss'ing (with -v pollsys) sshd then hitting enter ends with (lwp_sigmask calls elided):
21585:  write(2, " [ d t u c k e r @ s o l".., 19)      = 19
21583:   pollsys(0x08047890, 2, 0x00000000, 0x00000000) (sleeping...)
21583:           fd=4  ev=POLLRDNORM rev=0
21583:           fd=6  ev=POLLRDNORM rev=0x814 
21583:  pollsys(0x08047000, 2, 0x00000000, 0x00000000) (sleeping...)
21585:  read(0, 0x080473AC, 1)          (sleeping...)

The pids are:
 dtucker 21585 21583   0 11:40:29 pts/2       0:00 -bash
    root 21580 21579   0 11:40:27 pts/1       0:00 sshd -d -p 2022 -R
 dtucker 21583 21580   0 11:40:29 pts/1       0:00 sshd -d -p 2022 -R

so we've got the shell writing the prompt to stderr then waiting for input on stdin, which are the pty.  Looks reasonable.

The sshd post-auth privsep slave is blocked in poll (which is how select is implemented on Solaris).

sshd's descriptors are:

$ sudo lsof -p 21583 
COMMAND  PID USER   FD   TYPE             DEVICE SIZE/OFF      NODE NAME
[...]
sshd   21583 root    0u  VCHR              221,1          443220255 /dev/pts/1
sshd   21583 root    1u  VCHR              221,1          443220255 /dev/pts/1
sshd   21583 root    2u  VCHR              221,1          443220255 /dev/pts/1
sshd   21583 root    3r  DOOR                         0t0        46 /system/volatile/name_service_door (door to nscd[426]) (FA:->0xffffff00cf178700)
sshd   21583 root    4u  FIFO 0xffffff00dadd16c0      0t0    184434 (fifofs) PIPE->0xffffff00dadd1750
sshd   21583 root    5u  FIFO 0xffffff00dadd1750      0t0    184434 (fifofs) PIPE->0xffffff00dadd16c0
sshd   21583 root    6u  IPv4 0xffffff00ce59a100   0t5578       TCP sol11.dtucker.net:2022->client.dtucker.net:31563 (ESTABLISHED)
sshd   21583 root    7u  VCHR              238,2           81265120 /devices/pseudo/clone@0:ptm->ptm
sshd   21583 root   10u  VCHR              238,2           81265120 /devices/pseudo/clone@0:ptm->ptm

So, sshd is waiting to read from the TCP connection or a pipe, but not the pty master.  That seems odd.

For the record, sshd is blocking is in wait_until_can_do_something():

$ sudo gdb -q --args `pwd`/sshd -de -p 2022 -o useprivilegeseparation=no -r
(gdb) set follow-fork child
(gdb) run
[...]
debug1: Setting controlling tty using TIOCSCTTY.
^C
Program received signal SIGINT, Interrupt.
0xfec78ec5 in __pollsys () from /lib/libc.so.1
(gdb) bt
#0  0xfec78ec5 in __pollsys () from /lib/libc.so.1
#1  0xfec664c6 in _pollsys () from /lib/libc.so.1
#2  0xfec233b8 in pselect () from /lib/libc.so.1
#3  0xfec236b9 in select () from /lib/libc.so.1
#4  0x08074bbb in wait_until_can_do_something (readsetp=0x80479f8, 
    writesetp=0x80479f4, maxfdp=0x80479f0, nallocp=0x80479ec, max_time_ms=0)
    at ../../serverloop.c:370
#5  0x08075ce7 in server_loop2 (authctxt=0x814be78) at ../../serverloop.c:863
#6  0x08081a3b in do_authenticated2 (authctxt=0x814be78)
    at ../../session.c:2758
#7  0x0807be6c in do_authenticated (authctxt=0x814be78) at ../../session.c:271
#8  0x0806be26 in main (ac=7, av=0x81451a8) at ../../sshd.c:2324
Comment 3 Darren Tucker 2016-07-20 13:56:35 AEST
Comparing the truss from a client with ospeed=9600:

5172:   pollsys(0x08047890, 3, 0x00000000, 0x00000000) (sleeping...)
5172:           fd=4  ev=POLLRDNORM rev=0
5172:           fd=6  ev=POLLRDNORM rev=0x814  
5172:           fd=9  ev=POLLRDNORM rev=0x804  
5174:   read(0, 0x080473AC, 1)          (sleeping...)

$ sudo lsof -p 5172 
COMMAND  PID USER   FD   TYPE             DEVICE SIZE/OFF      NODE NAME
[...]
sshd    5172 root    4u  FIFO 0xffffff00dadd16c0      0t0    184472 (fifofs) PIPE->0xffffff00dadd1750
sshd    5172 root    5u  FIFO 0xffffff00dadd1750      0t0    184472 (fifofs) PIPE->0xffffff00dadd16c0
sshd    5172 root    6u  IPv4 0xffffff00de346380   0t6014       TCP sol11.dtucker.net:2022->quoll.dtucker.net:30321 (ESTABLISHED)
sshd    5172 root    7u  VCHR              238,2           81265120 /devices/pseudo/clone@0:ptm->ptm
sshd    5172 root    9u  VCHR              238,2           81265120 /devices/pseudo/clone@0:ptm->ptm
sshd    5172 root   10u  VCHR              238,2           81265120 /devices/pseudo/clone@0:ptm->ptm

so sshd is polling an extra descriptor attached to the pty master.  I don't know why yet though.
Comment 4 Darren Tucker 2016-07-20 14:41:05 AEST
Confirmed that commenting out only the calls to cfsetospeed and cfsetispeed in ttymodes.c prevents the problem.

Going back through a complete strace of sshd I see this:

6212:   pollsys(0x08047890, 3, 0x00000000, 0x00000000)  = 2
6212:           fd=4  ev=POLLRDNORM rev=0
6212:           fd=6  ev=POLLOUT|POLLRDNORM rev=POLLOUT
6212:           fd=9  ev=POLLRDNORM rev=POLLRDNORM
6212:   clock_gettime(4, 0x08047954)                    = 0
6212:   read(9, 0x0804391C, 16384)                      = 0
6212:   close(9)                                        = 0
6212:   write(6, "\0\0\010 BA0 q zF8 {88F9".., 84)      = 84
6212:   getpid()                                        = 6212 [6209]
6212:   clock_gettime(4, 0x080478B4)                    = 0
6212:   pollsys(0x08047890, 2, 0x00000000, 0x00000000)  = 1
6212:           fd=4  ev=POLLRDNORM rev=0
6212:           fd=6  ev=POLLOUT|POLLRDNORM rev=POLLOUT

where fd#9 is our missing FD on the pty master, and this is the first pollsys where fd#9 is included.

From this we can infer from this that sshd does in fact set up the descriptors correctly.  pollsys says that fd#9 is readable, and when sshd reads it the read returns zero, indicating end-of-file.  sshd then closes the descriptor and removes it from the set it's looking at, hence why we didn't see it in the later calls.

If this sounds familiar it's because we've encountered the same thing on AIX (but probably for different reasons, in that case it was passing through zero-byte writes from the pty slave).  There's already a workaround hack for it (PTY_ZEROREAD) which seems to help in this case, but I'd like to understand what the problem is (and what other potential side effects are) before we go enabling that on Solaris.

Anyone from Oracle care to comment?  Why does setting the baud rate on a pty to 57600 cause it to change its read behaviour?
Comment 5 Tomas Kuthan 2016-08-03 00:46:55 AEST
Thanks for reporting this issue.

It turns out, it is a long-standing Solaris kernel bug. It only pertains to the one particular speed (57600 baud), other rates (such as 115200, 230400 and 460800) work fine, provided that the HW device supports it.

I have a verified fix ready for this. We will fix it eventually in Solaris.

I am taking ownership of the bug for now, but we could just as well close it, as this is not a bug in OpenSSH.
Comment 6 Darren Tucker 2016-08-03 09:30:51 AEST
(In reply to Tomas Kuthan from comment #5)
> Thanks for reporting this issue.
> 
> It turns out, it is a long-standing Solaris kernel bug. It only
> pertains to the one particular speed (57600 baud), other rates (such
> as 115200, 230400 and 460800) work fine, provided that the HW device
> supports it.

That sounds fascinatingly specific.  Are you able to share the details?

> I have a verified fix ready for this. We will fix it eventually in
> Solaris.

Thanks!

> I am taking ownership of the bug for now, but we could just as well
> close it, as this is not a bug in OpenSSH.

Given that there's already a workaround (ie "don't do that then") and a proper fix coming I think we should close the bug.
Comment 7 Tomas Kuthan 2016-08-03 18:56:19 AEST
On Solaris, 16 original baud rates B0 through B38400 were represented in the four least significant bits of termios.c_cflag. When additional baud rates were added back in 1994, the adjacent bits were already assigned for a different purpose, so bits 22 and 23 were used to flag extended baudrates. Baud rates B57600 through B460800 were again represented in the least significant bits with the extension bit set. All the speed handling code was updated to take the extension bits into account.

... all the code, but this one if, that was left behind. B57600 is encoded with the extension bit set and with the 4 least significant bits all zeros. The neglected if has mistaken B57600 for B0. B0 means 'hang up'; hung up it did.

I am closing the bug as not-an-OpenSSH-bug. (Feel free to change the substatus.)
Comment 8 Damien Miller 2021-04-23 15:00:57 AEST
closing resolved bugs as of 8.6p1 release