When the OpenSSH SFTP server running on Compaq/HP Tru64 UNIX 5.1 is sent an "ls" command for a directory on an NFS-mounted volume, it either returns an endlessly-repeating directory listing, or seems to hang without returning anything. The problem has to do with the semantics of the readdir call, as used in the process_readdir routine in sftp-server.c. readdir returns NULL to indicate the end of the directory listing. However, process_readdir may then later call readdir again on the same directory handle. For a local volume, it will continue returning NULL. However, for an NFS-mounted volume, after returning NULL once, on Tru64 it will start returning the entire directory listing again. Solution: add a separate Boolean flag to the Handle structure. This flag is initialized to false when a new directory context is created (it is not needed for file contexts). It is set to true when a readdir call returns NULL. Once it is true, no subsequent calls to readdir are to be made on that context; instead the next call to process_readdir will immediately return a SSH2_FX_EOF status.
How is this a bug in sftp and not a bug in Tru64?
What client? As far as I can see your solution would not work. Since we open a FD handle on the server, process on that handle, then close/remove that handle. The only non-freed case is if the handle passed to it is not a HANDLE_DIR or HANDLE_FILE or the handle is invalid. Which then returns an errno of ENOENT. So unless your hitting some odd memory corruption. I'm not seeing how you are getting this behavior unless the client your using is not RFC legal.
Err, if this bug is innapropriate, blame me. I suggested that if you squinted and read the posix spec [1] at an angle in bad light the behaviour of Tru64's readdir might not be non-compliant. It says it must return "an ordered sequence of all the directory entries" which it apparently does, and return a NULL at the end, which it apparently does. Ben's point about the client is a good one though. What is it? [1] http://www.opengroup.org/onlinepubs/007904975/functions/readdir.html
Created attachment 319 [details] Fixes the problem The client used was OpenSSH 3.6.1ps SFTP, though I expect the problem would happen with any client. The bug is triggered because each time process_readdir is called, it collects up to 100 entries from readdir, or until readdir returns NULL. If it collected any entries at all, it doesn't remember which case it encountered. So it will get called again to return more entries, so it calls readdir again, which can start returning the entire directory listing again. And so on, and so on.
Could you please attach that in unified "diff -u" format? These are much easier to read.
Sorry, no -u option on Tru64 diff: diff -u sftp-server.c* diff: illegal option -- u Usage: diff [-bitw] [-c |-C Lines|-D String|-e|-f|-h|-n] File1 File2 diff [-bilrstw] [-c |-C Lines|-e|-f|-h|-n] [-S File] Directory1 Directory2
Try diff -c then
There are two aspects to this bug. 1. Tru64's kernel is broken. =) After readdir() returns NULL the first time it should return NULL every other time. Something Tru64 does not do when it is access files via NFS. 2. sftp-server depends on this behavior. the Posix definition just states "the end of the directory is encountered, a null pointer shall be returned and errno is not changed." which to me implies it should not rewind() the DIR* structure. Which is what seems to be happening. You need to really report this bug to Compaq/HP. Before I consider any work arounds I would perfer to know that this has not been resolved in a service pack.
Created attachment 320 [details] Fixes the problem OK, here's a diff -u done from a Linux box.
Which way is Tru64's kernel "broken": 1) It doesn't do it the way Linux does (true enough), or 2) It doesn't conform to the official spec (not so clear). Darren Tucker has previously referenced the spec at <http:// www.opengroup.org/onlinepubs/007904975/functions/readdir.html>. The relevant paragraph is the following: The readdir() function shall return a pointer to a structure representing the directory entry at the current position in the directory stream specified by the argument dirp, and position the directory stream at the next entry. It shall return a null pointer upon reaching the end of the directory stream. The structure dirent defined in the <dirent.h> header describes a directory entry. Now, what does this spec say about what happens when you call readdir again _after_ it has returned a null pointer? It's not clear at all. There doesn't seem to be any prerequisite attached to the post-condition " ...and position the directory stream at the next entry". So what is the "next" entry after the end of the directory stream? The Tru64 interpretation (start over from the beginning) seems as reasonable as anything else. Remember the robustness principle <ftp://ftp.isi.edu/in-notes/rfc791.txt>: "be liberal in what you accept, conservative in what you send."
Have you checked whether this is a reported bug with DEC/Compaq/HP? This *is* a bug - Neither BSD, Linux, Sun nor even Tru64 for non-NFS mounts behave this way. While SUSv3 and Darren's reference don't mention what happens after the initial NULL is returned, the behaviour to rewinddir() back to the beginning of the directory is so utterly unexpected and bogus that it should not need to be specified against. I'd consider including a workaround if you can point to system documentation describing this behaviour as being intended or if DEC/Compaq/HP don't accept this as a bug.
no followup == no bug
If you are claiming to offer "portable" code, then it is not really acceptable to use certain vendors' implementations as having definitive semantics, and others as not. You should go by the specification. And if the specification is unclear, you should not rely on that behaviour.
You have not provided one iota of evidence that this behaviour is not an OS bug. Given that a single platform is the only one exhibiting the problem, and then only for one specific file system, then it is almost certainly an OS bug. I'm willing be be convinced otherwise but, as I mentioned two months ago, the onus is on you to provide evidence. Please don't reopen the bug unless you are willing to do so.
>You have not provided one iota of evidence that this behaviour is >not an OS bug. We have given you the specification. Is that not the right specification? Is there some more official specification somewhere else that we're not aware of? You have not provided one iota of evidence that the Tru64/NFS behaviour is contrary to the specification. Your code does not work correctly with this behaviour. Therefore it is your code that has the bug.
Mass change of RESOLVED bugs to CLOSED