By default, umac.c is compiled with -O2 and performs well on x86 and x86_64 architectures. However, on other architectures the performance can be improved by adding -funroll-loops. Using Ted Krovetz's original code, the performance for 1KB blocks on various architectures (clocks per byte) is as follows: gcc -O2 gcc -O2 -funroll-loops x86_64: 0.95 1.04 IA64: 2.31 1.36 -funroll-loops SPARC: 9.52 9.50 POWER5: 3.88 3.67 The architecture that benefits the most from this is IA64. A memory-to-mekory test using ssh on a 1.5 GHz Itanium system shows an improvement of approximately 9 MB/s; 128 MB/s with just -O2 and 137 MB/s when -funroll-loops is added. It may be worthwhile adding the following to Makefile.in: umac.o: umac.c $(CC) $(CFLAGS) -funroll-loops $(CPPFLAGS) -c $? Admittedly, it has to be acknowledged that this would be slightly detrimental to x86_64 and for architectures other than IA64 the benefit appears to be marginal.
(In reply to comment #0) > umac.o: umac.c > $(CC) $(CFLAGS) -funroll-loops $(CPPFLAGS) -c $? Well the first problem with that not all compilers understand -funroll-loops, so it'll stop it compiling in anything that's not gcc (or pretending to be gcc). Can you just do "./configure --with-cflags=-funroll-loops"? or does that have a detrimental impact elsewhere?
Ah, yes, I hadn't considered the (lack of) portability with -funroll loops. I haven't seen any detrimental effects adding it for entire build process, but thought it might be better to take a more surgical approach. Using --with-cflags should do the trick. I've been using this performance tweak on Itanium for the past two and thought that it was about time to pass on the observation to the community. I have to admit that I almost didn't submit this bug when I saw the marginal benefit on other architectures.
As far as POWER goes (more specifically, AIX and xlc) there is a PDF describing the optimization 'process' when using vac/xlc as a compiler. The document # for VAC-v11 is SC27-2478-00 (or higher for the last two digits - the revision number). Here is a starting point for documentation: http://www-01.ibm.com/support/knowledgecenter/SSGH2K_11.1.0/com.ibm.xlc111.aix.doc/conventions/compiler_pubs.html On page (printed) page 47 (Chapter 7. Optimizing your applications) the path of moving from -O2 to higher degrees of optimization is discussed. On pages 48 and 49 it discusses -O3, -O4 and -O5 - and the bottom of page 49 discusses only adding the a variation of the options -qhot to act on loop transformations. In summary, -funroll-loops is not a vac/xlc flag I am aware of - but there is documentation to be had to help setup your own customization. imho, anything beyond -O2 needs careful. At least for xlc/vac, unless you specify -qstrict with -O3 and above you are permitting the compiler to reorder code (blocks).
Marking wontfix: history has passed IA64 by and umac is not too far behind it
Close all resolved bugs after 7.3p1 release