Bug 1102 - C program 'write' with zero length hangs
Summary: C program 'write' with zero length hangs
Status: CLOSED FIXED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: 4.1p1
Hardware: PPC AIX
: P2 normal
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks: V_4_4
  Show dependency treegraph
 
Reported: 2005-10-12 22:22 AEST by Tim Chamberlain
Modified: 2006-11-17 09:21 AEDT (History)
1 user (show)

See Also:


Attachments
sample program which fails (56 bytes, application/octet-stream)
2005-10-12 22:25 AEST, Tim Chamberlain
no flags Details
sample program which fails (56 bytes, application/octet-stream)
2005-10-12 22:27 AEST, Tim Chamberlain
no flags Details
text version of previous attachment (56 bytes, patch)
2005-10-12 22:34 AEST, Damien Miller
no flags Details | Diff
Additional comments (56 bytes, text/plain)
2005-10-12 23:07 AEST, Tim Chamberlain
no flags Details
Debug log files ZIPped up (7.62 KB, application/octet-stream)
2005-10-13 18:49 AEST, Tim Chamberlain
no flags Details
Log file extract (1.04 KB, text/plain)
2005-10-13 21:03 AEST, Tim Chamberlain
no flags Details
only close connection for zero-length stdin reads when errno set (2.42 KB, patch)
2005-10-13 21:31 AEST, Darren Tucker
no flags Details | Diff
Handle zero-length reads on AIX ptys (1.92 KB, patch)
2005-10-17 23:47 AEST, Darren Tucker
no flags Details | Diff
Handle zero-length reads on AIX only (2.62 KB, patch)
2006-06-23 20:55 AEST, Darren Tucker
djm: ok+
Details | Diff
Fixes for bug #1102 as applied to the tree. (2.19 KB, patch)
2006-11-17 09:19 AEDT, Darren Tucker
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Chamberlain 2005-10-12 22:22:35 AEST
A simple 'write( )' system call from a 'C' program with zero as the value of the data length hangs forever. Easily reproducable via a "one-line" program with the statement

write(1,"anything",0);

It only appears to be an AIX problem with both 5.2 and 5.3 failing but works fine with SCO-Unix. Attached nullout.c will save a little bit of typing.

Our client has since tried it with "OpenSSH_4.2p1 and OpenSSL 0.9.8 5-Jul-05" build and reports it still fails.
Comment 1 Tim Chamberlain 2005-10-12 22:25:24 AEST
Created attachment 987 [details]
sample program which fails
Comment 2 Tim Chamberlain 2005-10-12 22:27:11 AEST
Created attachment 988 [details]
sample program which fails
Comment 3 Damien Miller 2005-10-12 22:34:42 AEST
Created attachment 990 [details]
text version of previous attachment

this looks like a kernel bug on your OS - I can't see what it has to do with OpenSSH.
Comment 4 Tim Chamberlain 2005-10-12 23:07:27 AEST
Created attachment 991 [details]
Additional comments

That was my initial thought.

However the sample program works (without recompilation/relink) when logged in locally and via 'rlogin' or 'telnet' -- to me this would point to the ssh i/o interface and/or driver on the host system.

Additionally we have tried both the ETerm and PuTTY clients and they fail identically on AIX only.
Comment 5 Darren Tucker 2005-10-12 23:10:43 AEST
Which AIX Maintenance Levels do your systems have?  Does the problem occur with other pty-using programs such as telnetd?

(In reply to comment #3)
> this looks like a kernel bug on your OS - I can't see what it has to do with
> OpenSSH.

I agree.  Now some history: way back when dinosaurs roamed the earth (around AIX 4.3.3 ML 3 or so) the pty layer on AIX started returning zero for read() syscalls after zero-length writes to the pty.

This was a problem for sshd, since POSIX says that a return code of zero from read() means EOF; this effectively meant that a program performing zero-length writes such as yours would result in sshd closing the session.  Since this remained busted for quite a while, sshd was changed to ignore such zero-length reads to work around it (see bug #124 for the gory details).

I'm wondering if maybe IBM has attempted to fix this and gone to the other extreme?  AFAICT the zero-length write should be a no-op...  It's also possible that the the work-around now has a side-effect.
Comment 6 Darren Tucker 2005-10-12 23:56:34 AEST
Could you please attach (as an attachment not in the comment field) the output from the server debugging when you run your program?  (ie "/path/to/sshd -ddde -p 2022" then connect to the server on port 2022 and run your program)?

BTW, I had a look for the changes mention in bug #124 but didn't find the zero-length fix where I expected.   I'll need to look closer at that when I get a chance.
Comment 7 Tim Chamberlain 2005-10-13 18:49:49 AEST
Created attachment 992 [details]
Debug log files ZIPped up

These are the debug files you asked for. I did it with both E-Term32 and PuTTY. Ihave included the debug output from the AIX server (aix-*.log files) and from the PC client (pc-*.log files)
Comment 8 Darren Tucker 2005-10-13 20:03:08 AEST
Did this line appear in the sshd debug output immediately after you ran your program?
debug2: channel 0: rcvd adjust 2
debug2: channel 0: read<=0 rfd 10 len 0
debug2: channel 0: read failed

BTW, you didn't mention which AIX Maintenance Level and/or PTF you have on your systems.
Comment 9 Tim Chamberlain 2005-10-13 21:03:08 AEST
Created attachment 993 [details]
Log file extract

Simple answer to your questin is YES but for completeness I have extracted the part of the logfile that occurs for the duration of the test program.

I'm still waiting for the maint/patch level info from our client.
Comment 10 Darren Tucker 2005-10-13 21:31:16 AEST
Created attachment 994 [details]
only close connection for zero-length stdin reads when errno set

I don't think that your program isn't really hanging, although it looks that way.  What I think is happening is that your zero-length write results in a zero length read in sshd, which results in the channel being shut down.

sshd is waiting for all of the file descriptors to close, while your program (or the shell) is waiting for its stdin to be read.  With them deadlocked, it would appear that the sshd session hung.

I just read the SuSv3 specs for read(2) (http://www.opengroup.org/onlinepubs/000095399/functions/read.html).  It's not clear but it appears that returning a zero-length read is permitted for STREAMS sockets (although I didn't think AIX's pty layer was STREAMS based).  So, AIX's behaviour might be compliant, although quite unusual.

Anyway, please try the attached patch (against -current but should apply to 4.1p1 or 4.2p1).  It's a bit ugly but it seems to be the only way to handle the zero-length case, assuming the above is correct.
Comment 11 Darren Tucker 2005-10-17 23:47:18 AEST
Created attachment 1002 [details]
Handle zero-length reads on AIX ptys

I don't think the change to the control socket code is not necessary so I've removed it.  Hopefully this will still resolve the problem.
Comment 12 Darren Tucker 2006-06-23 20:55:22 AEST
Created attachment 1147 [details]
Handle zero-length reads on AIX only

I was wondering if there's any platforms out there that don't set errno... so this ought to be safer (although admittedly uglier).

Unless there are objections I'd like to commit this one.
Comment 13 Damien Miller 2006-06-23 21:07:52 AEST
Comment on attachment 1147 [details]
Handle zero-length reads on AIX only

looks ok to me
Comment 14 Darren Tucker 2006-06-23 21:24:49 AEST
Thanks all, this patch has been applied and will be in v4.4.
Comment 15 Darren Tucker 2006-09-28 19:25:34 AEST
With the release of 4.4, we believe that this bug is now closed.  For information about the release please see http://www.openssh.com/txt/release-4.4 .
Comment 16 Varun Sethi 2006-11-17 07:44:11 AEDT
(In reply to comment #12)
> Created an attachment (id=1147) [details]
> Handle zero-length reads on AIX only
> I was wondering if there's any platforms out there that don't set
> errno... so this ought to be safer (although admittedly uglier).
> Unless there are objections I'd like to commit this one

Shouldn't the fix in channels.c 
+#ifndef PTY_ZEROREAD
 		if (len <= 0) {
+#else
+		if (len < 0 || (len == 0 && errno != 0)) {
+#endif

Actuall be 
+#ifdef PTY_ZEROREAD
 		if (len <= 0) {
+#else
+		if (len < 0 || (len == 0 && errno != 0)) {
+#endif

After applying the modified (changing ifndef to ifdef in channels.c) fix on AIX. The problem of ssh session hang is resolved. But now I face another problem on AIX. 
Task: Login with ssh WITH the -X or -Y option and start and ending wish
and trying to logout
Result: DISPLAY variable correctly set. After ending wish and trying to
logout from the ssh shell the shell displayed: logout and then hangs
there. The hanging ssh shell must be ended with CRTL-C
Steps to reproduce:
client  prompt$ ssh -X server
server  prompt$ wish
     wish prompt: exit
server prompt $ exit
   logout
Comment 17 Varun Sethi 2006-11-17 07:46:25 AEDT
(In reply to comment #16)
> (In reply to comment #12)
> > Created an attachment (id=1147) [details] [details]
> > Handle zero-length reads on AIX only
> > I was wondering if there's any platforms out there that don't set
> > errno... so this ought to be safer (although admittedly uglier).
> > Unless there are objections I'd like to commit this one
> Shouldn't the fix in channels.c 
> +#ifndef PTY_ZEROREAD
>                 if (len <= 0) {
> +#else
> +               if (len < 0 || (len == 0 && errno != 0)) {
> +#endif
> Actuall be 
> +#ifdef PTY_ZEROREAD
>                 if (len <= 0) {
> +#else
> +               if (len < 0 || (len == 0 && errno != 0)) {
> +#endif
> After applying the modified (changing ifndef to ifdef in channels.c)
> fix on AIX. The problem of ssh session hang is resolved. But now I face
> another problem on AIX. 
> Task: Login with ssh WITH the -X or -Y option and start and ending wish
> and trying to logout
> Result: DISPLAY variable correctly set. After ending wish and trying to
> logout from the ssh shell the shell displayed: logout and then hangs
> there. The hanging ssh shell must be ended with CRTL-C
> Steps to reproduce:
> client  prompt$ ssh -X server
> server  prompt$ wish
>      wish prompt: exit
> server prompt $ exit
>    logout

Forgot to mention that I applied tha patch on openssh-4.1
Comment 18 Darren Tucker 2006-11-17 09:19:41 AEDT
Created attachment 1208 [details]
Fixes for bug #1102 as applied to the tree.

> Shouldn't the fix in channels.c 
> +#ifndef PTY_ZEROREAD

It's the other way around, but yes the patch here is wrong.  It got fixed in the tree, and the fixes are in the 4.4p1 and 4.5p1 releases.

The changes were:

 - (dtucker) [channels.c configure.ac serverloop.c] Bug #1102: Around AIX
   4.3.3 ML3 or so, the AIX pty layer starting passing zero-length writes
   on the pty slave as zero-length reads on the pty master, which sshd
   interprets as the descriptor closing.  Since most things don't do zero
   length writes this rarely matters, but occasionally it happens, and when
   it does the SSH pty session appears to hang, so we add a special case for
   this condition.  ok djm@
 - (dtucker) [serverloop.c] Get ifdef/ifndef the right way around for the bug
   #1102 workaround.
 - (dtucker) [channels.c serverloop.c] Apply the bug #1102 workaround to ptys
   only, otherwise sshd can hang exiting non-interactive sessions.

These are included in the attached patch.

As for the thing with wish, if you are having a different problem with a current release then please open a new bug for it (mention this bug# if you want).  I am going to re-close this bug.