small prefactor. Inner-most loops, analyzed with MAQAO : • Perfect ADD/MUL balance • Does not saturate load/store units • Only vector operations with no peel/tail loops • Uses 15 AVX registers. No register spilling • If all data fits in L1, 100% peak is reached (16 flops/cycle) • In practice: memory bound, so 50-60% peak is measured. MAQAO: Modular assembler quality Analyzer and Optimizer for Itanium 2 L.Djoudi, D.Barthou, P.Carribault, C.Lemuet, A.-T.Acquaviva, and W.Jalby, Workshop on EPIC Architectures and Compiler Technology, San Jose, (2005). QMC for large chemical systems: Implementing efficient strategies for petascale platforms and beyond A.Scemama, M.Caffarel, E.Oseret, W.Jalby, J. Comput. Chem., 34:11(938--951) (2013). 27