in chemistry at the petascale level and beyond A. Scemama1, M. Caffarel1, E. Oseret2, W. Jalby2 1Laboratoire de Chimie et Physique Quantiques / IRSAMC, Toulouse, France 2Exascale Computing Research / Intel, CEA, GENCI, UVSQ Versailles, France 28 June 2012 A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
Solve the Schrödinger equation with random walks State-of-the-art and routine approaches in physics : nuclear physics, condensed-matter, spin systems, quantum liquids, infrared spectroscopy . . . Still of confidential use for the electronic structure problem of quantum chemistry (as opposed to post-HF and DFT) Reason : Very high computational cost for small/medium systems But : Very favorable scaling with system size compared to standard methods Ideally suited to extreme parallelism A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
Solve the Schrödinger equation with random walks State-of-the-art and routine approaches in physics : nuclear physics, condensed-matter, spin systems, quantum liquids, infrared spectroscopy . . . Still of confidential use for the electronic structure problem of quantum chemistry (as opposed to post-HF and DFT) Reason : Very high computational cost for small/medium systems But : Very favorable scaling with system size compared to standard methods Ideally suited to extreme parallelism A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
Solve the Schrödinger equation with random walks State-of-the-art and routine approaches in physics : nuclear physics, condensed-matter, spin systems, quantum liquids, infrared spectroscopy . . . Still of confidential use for the electronic structure problem of quantum chemistry (as opposed to post-HF and DFT) Reason : Very high computational cost for small/medium systems But : Very favorable scaling with system size compared to standard methods Ideally suited to extreme parallelism A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
Solve the Schrödinger equation with random walks State-of-the-art and routine approaches in physics : nuclear physics, condensed-matter, spin systems, quantum liquids, infrared spectroscopy . . . Still of confidential use for the electronic structure problem of quantum chemistry (as opposed to post-HF and DFT) Reason : Very high computational cost for small/medium systems But : Very favorable scaling with system size compared to standard methods Ideally suited to extreme parallelism A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
Solve the Schrödinger equation with random walks State-of-the-art and routine approaches in physics : nuclear physics, condensed-matter, spin systems, quantum liquids, infrared spectroscopy . . . Still of confidential use for the electronic structure problem of quantum chemistry (as opposed to post-HF and DFT) Reason : Very high computational cost for small/medium systems But : Very favorable scaling with system size compared to standard methods Ideally suited to extreme parallelism A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
vector of R3N containing the electron positions Drifted diffusion of walkers with birth/death process to generate the 3N-density (Ψ × Φ) (needs Ψ, ∇Ψ, ∆Ψ) Compute HΨ(r1,...,rN) Ψ(r1,...,rN) for all positions The energy is the average of all computed HΨ(r1,...,rN) Ψ(r1,...,rN) Extreme parallelism : Independent populations of walkers can be distributed on different CPUs A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
vector of R3N containing the electron positions Drifted diffusion of walkers with birth/death process to generate the 3N-density (Ψ × Φ) (needs Ψ, ∇Ψ, ∆Ψ) Compute HΨ(r1,...,rN) Ψ(r1,...,rN) for all positions The energy is the average of all computed HΨ(r1,...,rN) Ψ(r1,...,rN) Extreme parallelism : Independent populations of walkers can be distributed on different CPUs A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
vector of R3N containing the electron positions Drifted diffusion of walkers with birth/death process to generate the 3N-density (Ψ × Φ) (needs Ψ, ∇Ψ, ∆Ψ) Compute HΨ(r1,...,rN) Ψ(r1,...,rN) for all positions The energy is the average of all computed HΨ(r1,...,rN) Ψ(r1,...,rN) Extreme parallelism : Independent populations of walkers can be distributed on different CPUs A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
vector of R3N containing the electron positions Drifted diffusion of walkers with birth/death process to generate the 3N-density (Ψ × Φ) (needs Ψ, ∇Ψ, ∆Ψ) Compute HΨ(r1,...,rN) Ψ(r1,...,rN) for all positions The energy is the average of all computed HΨ(r1,...,rN) Ψ(r1,...,rN) Extreme parallelism : Independent populations of walkers can be distributed on different CPUs A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
vector of R3N containing the electron positions Drifted diffusion of walkers with birth/death process to generate the 3N-density (Ψ × Φ) (needs Ψ, ∇Ψ, ∆Ψ) Compute HΨ(r1,...,rN) Ψ(r1,...,rN) for all positions The energy is the average of all computed HΨ(r1,...,rN) Ψ(r1,...,rN) Extreme parallelism : Independent populations of walkers can be distributed on different CPUs A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
: Nwalk walkers executing Nstep steps Compute as many blocks as possible, as quickly as possible Block averages have a Gaussian distribution N step N proc N walk CPU time A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
: Nwalk walkers executing Nstep steps Compute as many blocks as possible, as quickly as possible Block averages have a Gaussian distribution N step N proc N walk CPU time A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
: Nwalk walkers executing Nstep steps Compute as many blocks as possible, as quickly as possible Block averages have a Gaussian distribution N step N proc N walk CPU time A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
I/O and network communications are asynchronous Master compute node Data Server Slave Compute node Manager Database Main worker thread Forwarder Forwarder Worker Worker Worker Network Thread I/O Thread Worker Worker Worker A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
possible system failures Blocks are Gaussian → losing blocks doesn’t change the average Simulation survives to removal of any node Restart always possible from data base Forwarder Data Server Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder DataBase Data Server Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
Carlo step Matrix inversion O(N3) (DP,Intel MKL) Sparse×dense matrix products O(N2) (SP,our implementation) Efficiency of the matrix products : Static analysis (MAQAO) : Full-AVX (no scalar operations), inner-most loops perform 16 flops/cycle Decremental analysis (DECAN) : good balance between flops and memory operations Up to 64% of the peak measured on Xeon E5 A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
Carlo step Matrix inversion O(N3) (DP,Intel MKL) Sparse×dense matrix products O(N2) (SP,our implementation) Efficiency of the matrix products : Static analysis (MAQAO) : Full-AVX (no scalar operations), inner-most loops perform 16 flops/cycle Decremental analysis (DECAN) : good balance between flops and memory operations Up to 64% of the peak measured on Xeon E5 A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
Carlo step Matrix inversion O(N3) (DP,Intel MKL) Sparse×dense matrix products O(N2) (SP,our implementation) Efficiency of the matrix products : Static analysis (MAQAO) : Full-AVX (no scalar operations), inner-most loops perform 16 flops/cycle Decremental analysis (DECAN) : good balance between flops and memory operations Up to 64% of the peak measured on Xeon E5 A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
Carlo step Matrix inversion O(N3) (DP,Intel MKL) Sparse×dense matrix products O(N2) (SP,our implementation) Efficiency of the matrix products : Static analysis (MAQAO) : Full-AVX (no scalar operations), inner-most loops perform 16 flops/cycle Decremental analysis (DECAN) : good balance between flops and memory operations Up to 64% of the peak measured on Xeon E5 A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
on Curie First step in our scientific project : All-electron calculation of the energy difference between the β-strand and the α-helix conformations of amyloid peptide Aβ(28-35) 122 atoms, 434 electrons, cc-pVTZ basis set (2960 basis functions) A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
on Curie Scientific results (cc-pVTZ basis set) : Standard DFT (B3LYP) : 10.7 kcal/mol DFT with empirical corrections (SSB-D) : 35.8 kcal/mol All-electron MP2 : 39.3 kcal/mol CCSD(T) would require at least 100 million CPU hours QMC in < 2 million CPU hours (1 day) : 39.7 ± 2. kcal/mol QMC calculations can be made on these systems −→ study of the interaction of Copper ions with β-amyloids Technological results : Sustained 960 TFlops/s (Mixed SP/DP) on 76 800 cores of Curie ∼ 80% parallel speed-up. (Today, it would be > 95 % : run termination was optimized) A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
on Curie Scientific results (cc-pVTZ basis set) : Standard DFT (B3LYP) : 10.7 kcal/mol DFT with empirical corrections (SSB-D) : 35.8 kcal/mol All-electron MP2 : 39.3 kcal/mol CCSD(T) would require at least 100 million CPU hours QMC in < 2 million CPU hours (1 day) : 39.7 ± 2. kcal/mol QMC calculations can be made on these systems −→ study of the interaction of Copper ions with β-amyloids Technological results : Sustained 960 TFlops/s (Mixed SP/DP) on 76 800 cores of Curie ∼ 80% parallel speed-up. (Today, it would be > 95 % : run termination was optimized) A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
on Curie Scientific results (cc-pVTZ basis set) : Standard DFT (B3LYP) : 10.7 kcal/mol DFT with empirical corrections (SSB-D) : 35.8 kcal/mol All-electron MP2 : 39.3 kcal/mol CCSD(T) would require at least 100 million CPU hours QMC in < 2 million CPU hours (1 day) : 39.7 ± 2. kcal/mol QMC calculations can be made on these systems −→ study of the interaction of Copper ions with β-amyloids Technological results : Sustained 960 TFlops/s (Mixed SP/DP) on 76 800 cores of Curie ∼ 80% parallel speed-up. (Today, it would be > 95 % : run termination was optimized) A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
on Curie Scientific results (cc-pVTZ basis set) : Standard DFT (B3LYP) : 10.7 kcal/mol DFT with empirical corrections (SSB-D) : 35.8 kcal/mol All-electron MP2 : 39.3 kcal/mol CCSD(T) would require at least 100 million CPU hours QMC in < 2 million CPU hours (1 day) : 39.7 ± 2. kcal/mol QMC calculations can be made on these systems −→ study of the interaction of Copper ions with β-amyloids Technological results : Sustained 960 TFlops/s (Mixed SP/DP) on 76 800 cores of Curie ∼ 80% parallel speed-up. (Today, it would be > 95 % : run termination was optimized) A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry
on Curie Scientific results (cc-pVTZ basis set) : Standard DFT (B3LYP) : 10.7 kcal/mol DFT with empirical corrections (SSB-D) : 35.8 kcal/mol All-electron MP2 : 39.3 kcal/mol CCSD(T) would require at least 100 million CPU hours QMC in < 2 million CPU hours (1 day) : 39.7 ± 2. kcal/mol QMC calculations can be made on these systems −→ study of the interaction of Copper ions with β-amyloids Technological results : Sustained 960 TFlops/s (Mixed SP/DP) on 76 800 cores of Curie ∼ 80% parallel speed-up. (Today, it would be > 95 % : run termination was optimized) A. Scemama, M. Caffarel, E. Oseret, W. Jalby QMC simulations in chemistry