Bug 3475 - clang-15 amd64 ED25519 signature verification nondeterministic spurious failure
Summary: clang-15 amd64 ED25519 signature verification nondeterministic spurious failure
Status: CLOSED FIXED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: 9.0p1
Hardware: amd64 Linux
: P5 minor
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks: V_9_2
  Show dependency treegraph
 
Reported: 2022-09-25 04:22 AEST by Daniel Pouzzner
Modified: 2023-03-17 13:42 AEDT (History)
3 users (show)

See Also:


Attachments
patch with a kludge fix (616 bytes, patch)
2022-09-25 04:22 AEST, Daniel Pouzzner
no flags Details | Diff
Avoid fzero-call-used-regs on possibly buggy clang versions (1.85 KB, patch)
2022-11-30 10:43 AEDT, Darren Tucker
djm: ok+
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Pouzzner 2022-09-25 04:22:20 AEST
Created attachment 3614 [details]
patch with a kludge fix

An update to net-misc/openssh-9.0_p1-r6 (Gentoo port) this morning led immediately to nondeterministic (~50% failure rate) connection failures, both in ssh and sshd, with this message:

ssh_dispatch_run_fatal: Connection to a.b.c.d port 22: incorrect signature

It quickly became apparent that the problem only affected connections authenticated with ED25519.

I found, by instrumenting ed25519.c:crypto_sign_ed25519_open(), that bit #256 (of 255 valid bits) in sm[] was nondeterministically set or clear, whereas the corresponding bit in t2[] was always clear.

I strongly suspect the new behavior is due to new compiler behavior, because (1) reverting to openssh-8.9_p1 fixed the problem even though all ED source code is identical, and (2) reverting to a rebuilt openssh-9.0_p1-r4 (Gentoo port revision again), which was functioning flawlessly before the rebuild, did not fix the problem.

The compiler version for the openssh-9.0_p1-r4 that worked right was gcc-11.3.0, versus currently installed gcc-11.3.1_p20220909.

For now I've kludged a fix for myself that just copies the spurious 256th bit from sm[31] to t2[31].  See attached patch.  This is not the right way to fix the problem of course.
Comment 1 Daniel Pouzzner 2022-09-26 01:53:58 AEST
This glitch turns out to have been caused by building with the llvm-15.0.1+clang-15.0.1 toolchain (by accident/Gentoo portage/package bug).  It's highly repeatable, and is either a bug in the compiler/toolchain, or a bug somewhere in openssh around the handling of ED25519 signature blobs and whatnot.

The ssh and sshd built under clang-15 cause the "incorrect signature" failure for ED25519 on incoming connections to ED25519-hostkeyed sshd, and outgoing connections to e.g. github.com, which use ED25519 hostkeys.

Building under gcc-11.3.1 resolves all ED25519 problems, inbound and outbound.
Comment 2 Darren Tucker 2022-09-28 06:01:01 AEST
I have repro'd on Fedora with a clang-15.0.1 built from git.

So far I've found that for some builds, "make t-exec LTESTS=connect" will reliably reproduce the error for me.  In some cases, deleting ssh-ed25519.o and rebuilding will result in a binary that does not exhibit the problem (and such a binary will run the regress suite (at least until I got bored).  To repro:

$ CC=/opt/clang-15.0.1/bin/clang ../../configure
$ for i in `seq 0 9`; do make clean >/dev/null 2>&1; make -j24 >/dev/null 2>&1; for j in `seq 0 9`; do if make t-exec LTESTS=connect >/dev/null 2>&1; then echo -n "good "; else echo -n "bad "; fi;done; echo; done
bad bad bad bad bad bad bad bad bad bad 
bad bad bad bad bad bad bad bad bad bad 
bad bad bad bad bad bad bad bad bad bad 
good good good good good good good good good good 
bad bad bad bad bad bad bad bad bad bad 
bad bad bad bad bad bad bad bad bad bad 
good good good good good good good good good good 
bad bad bad bad bad bad bad bad bad bad 
bad bad bad bad bad bad bad bad bad bad 
good good good good good good good good good good

Now if I remove -fzero-call-used-regs=all from the Makefile with no other changes:

$ for i in `seq 0 9`; do make clean >/dev/null 2>&1; make -j24 >/dev/null 2>&1; for j in `seq 0 9`; do if make t-exec LTESTS=connect >/dev/null 2>&1; then echo -n "good "; else echo -n "bad "; fi;done; echo; done
good good good good good good good good good good 
good good good good good good good good good good 
good good good good good good good good good good 
good good good good good good good good good good 
good good good good good good good good good good 
good good good good good good good good good good 
good good good good good good good good good good 
good good good good good good good good good good 
good good good good good good good good good good 
good good good good good good good good good good 

To me, this smells like clang 15.0.1 has some compile-time undefined behaviour in -fzero-call-used-regs.
Comment 3 Damien Miller 2022-10-14 17:29:34 AEDT
I just ran into this in another context, so we need to get a minified testcase we can submit to clang as well as taking some action in configure.ac; either removing these flags entirely or turning them off for broken compiler versions
Comment 4 Bill Wendling 2022-11-29 12:38:16 AEDT
I created https://github.com/llvm/llvm-project/issues/59242 to address this issue.
Comment 5 Bill Wendling 2022-11-30 06:38:59 AEDT
It looks like Clang 15.0.6 was just release. There's a fix in it for a similar OpenSSH bug. Could you give it a try please?

https://releases.llvm.org/
Comment 6 Darren Tucker 2022-11-30 10:18:40 AEDT
15.0.6 seems to have the same problem.

$ /opt/clang-15.0.6/bin/clang --version
clang version 15.0.6 (https://github.com/llvm/llvm-project.git 088f33605d8a61ff519c580a71b1dd57d16a03f8)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/clang-15.0.6/bin

$ CC=/opt/clang-15.0.6/bin/clang ../../configure

$ for i in `seq 0 9`; do make clean >/dev/null 2>&1; make -j24 >/dev/null 2>&1; for j in `seq 0 9`; do if make t-exec LTESTS=connect >/dev/null 2>&1; then echo -n "good "; else echo -n "bad "; fi;done; echo; done
bad bad bad bad bad bad bad bad bad bad 
good good good good good good good good good good 
good good good good good good good good good good 
bad bad bad bad bad bad bad bad bad bad 
good good good good good good good good good good 
good good good good good good good good good good 
good good good good good good good good good good 
good good good good good good good good good good 
bad bad bad bad bad bad bad bad bad bad 
good good good good good good good good good good 

$ tail -4 regress/failed-ssh.log 
debug2: ssh_ed25519_verify: crypto_sign_ed25519_open failed: -1
ssh_dispatch_run_fatal: Connection to UNKNOWN port 65535: incorrect signature
FAIL: ssh proxycommand connect failed
Comment 7 Darren Tucker 2022-11-30 10:43:35 AEDT
Created attachment 3628 [details]
Avoid fzero-call-used-regs on possibly buggy clang versions

Given that the problem seems to occur with several released clang versions I think we should avoid the problematic flag for all versions of clang-15.   We can add more specific rules if we later learn of versions that work.
Comment 8 Bill Wendling 2022-11-30 11:27:35 AEDT
Until we have a fix, you might want to use `-fzero-call-used-regs=used` instead of `all`. The latter is a huge hammer and could zero out registers that are still in use up the call chain.
Comment 9 Darren Tucker 2022-11-30 12:01:43 AEDT
Thanks, that also passes the test above.  I've changed the diff to pass "=used" instead and "=all" if clang 15 is detected and committed it as https://github.com/openssh/openssh-portable/commit/62cc33e6eed847aafdc29e34aa69e9bd82a0ee16.
Comment 10 Bill Wendling 2022-12-13 11:47:56 AEDT
Should be fixed by https://reviews.llvm.org/D139679. I'll see if we can backport it to 5.0.7.
Comment 11 Bill Wendling 2022-12-14 10:11:38 AEDT
Should be fixed by https://reviews.llvm.org/D139679. I'll see if we can backport it to 5.0.7.
Comment 12 Darren Tucker 2022-12-16 14:34:52 AEDT
I have cherry-picked the configure.ac change into the V_9_1 branch too, so if we do a 9.1p2 release it'll be in that too.

If the clange fix makes it into clang 15 releases and it's important enough we can explicitly enable it on those, otherwise we'll just leave it until clang eventually moves to v16.

Thanks all.
Comment 13 Damien Miller 2023-03-17 13:42:08 AEDT
OpenSSH 9.3 has been released. Close resolved bugs