Bug 1753 - Use -funroll-loops with umac.c
Summary: Use -funroll-loops with umac.c
Status: CLOSED WONTFIX
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: Build system (show other bugs)
Version: -current
Hardware: Itanium Other
: P2 enhancement
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-10 08:18 AEST by Iain Morgan
Modified: 2016-08-02 10:41 AEST (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Iain Morgan 2010-04-10 08:18:42 AEST
By default, umac.c is compiled with -O2 and performs well on x86 and
x86_64 architectures. However, on other architectures the performance
can be improved by adding -funroll-loops.

Using Ted Krovetz's original code, the performance for 1KB blocks on
various architectures (clocks per byte) is as follows:

        gcc -O2         gcc -O2 -funroll-loops
x86_64: 0.95            1.04
IA64:   2.31            1.36    -funroll-loops
SPARC:  9.52            9.50
POWER5: 3.88            3.67

The architecture that benefits the most from this is IA64. A
memory-to-mekory test using ssh on a 1.5 GHz Itanium system shows an
improvement of approximately 9 MB/s; 128 MB/s with just -O2 and 137 MB/s
when -funroll-loops is added.

It may be worthwhile adding the following to Makefile.in:

umac.o: umac.c
        $(CC) $(CFLAGS) -funroll-loops $(CPPFLAGS) -c $?

Admittedly, it has to be acknowledged that this would be slightly
detrimental to x86_64 and for architectures other than IA64 the benefit
appears to be marginal.
Comment 1 Darren Tucker 2010-04-10 10:25:36 AEST
(In reply to comment #0)
> umac.o: umac.c
>         $(CC) $(CFLAGS) -funroll-loops $(CPPFLAGS) -c $?

Well the first problem with that not all compilers understand -funroll-loops, so it'll stop it compiling in anything that's not gcc (or pretending to be gcc).

Can you just do "./configure --with-cflags=-funroll-loops"?  or does that have a detrimental impact elsewhere?
Comment 2 Iain Morgan 2010-04-10 11:22:29 AEST
Ah, yes, I hadn't considered the (lack of) portability with
-funroll loops. I haven't seen any detrimental effects adding it for
entire build process, but thought it might be better to take a more
surgical approach. Using --with-cflags should do the trick.

I've been using this performance tweak on Itanium for the past two
and thought that it was about time to pass on the observation to the
community. I have to admit that I almost didn't submit this bug when I
saw the marginal benefit on other architectures.
Comment 3 Michael Felt 2015-06-07 07:31:49 AEST
As far as POWER goes (more specifically, AIX and xlc) there is a PDF describing the optimization 'process' when using vac/xlc as a compiler.
The document # for VAC-v11 is SC27-2478-00 (or higher for the last two digits - the revision number).

Here is a starting point for documentation: http://www-01.ibm.com/support/knowledgecenter/SSGH2K_11.1.0/com.ibm.xlc111.aix.doc/conventions/compiler_pubs.html

On page (printed) page 47 (Chapter 7. Optimizing your applications) the path of moving from -O2 to higher degrees of optimization is discussed.

On pages 48 and 49 it discusses -O3, -O4 and -O5 - and the bottom of page 49 discusses only adding the a variation of the options -qhot to act on loop transformations.

In summary, -funroll-loops is not a vac/xlc flag I am aware of - but there is documentation to be had to help setup your own customization. imho, anything beyond -O2 needs careful. At least for xlc/vac, unless you specify -qstrict with -O3 and above you are permitting the compiler to reorder code (blocks).
Comment 4 Damien Miller 2015-11-16 12:20:34 AEDT
Marking wontfix: history has passed IA64 by and umac is not too far behind it
Comment 5 Damien Miller 2016-08-02 10:41:51 AEST
Close all resolved bugs after 7.3p1 release