Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software optimization for petaflops/s scale Qua...

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

Anthony Scemama

December 04, 2012
Tweet

More Decks by Anthony Scemama

Other Decks in Programming

Transcript

  1. logos Quantum Monte Carlo The QMC=Chem code Software optimization for

    petaflops/s scale Quantum Monte Carlo simulations A. Scemama1, M. Caffarel1, E. Oseret2, W. Jalby2 1Laboratoire de Chimie et Physique Quantiques / IRSAMC, Toulouse, France 2Exascale Computing Research / Intel, CEA, GENCI, UVSQ Versailles, France http://qmcchem.ups-tlse.fr 4 Dec 2012 A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  2. logos Quantum Monte Carlo The QMC=Chem code Outline 1 Quantum

    Monte Carlo 2 The QMC=Chem code A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  3. logos Quantum Monte Carlo The QMC=Chem code Quantum Monte Carlo

    methods Solve the Schrödinger equation with random walks State-of-the-art and routine approaches in physics : nuclear physics, condensed-matter, spin systems, quantum liquids, infrared spectroscopy . . . Still of confidential use for the electronic structure problem of quantum chemistry (as opposed to post-HF and DFT) Reason : Very high computational cost for small/medium systems But : Very favorable scaling with system size compared to standard methods Ideally suited to extreme parallelism A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  4. logos Quantum Monte Carlo The QMC=Chem code Quantum Monte Carlo

    methods Solve the Schrödinger equation with random walks State-of-the-art and routine approaches in physics : nuclear physics, condensed-matter, spin systems, quantum liquids, infrared spectroscopy . . . Still of confidential use for the electronic structure problem of quantum chemistry (as opposed to post-HF and DFT) Reason : Very high computational cost for small/medium systems But : Very favorable scaling with system size compared to standard methods Ideally suited to extreme parallelism A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  5. logos Quantum Monte Carlo The QMC=Chem code Quantum Monte Carlo

    methods Solve the Schrödinger equation with random walks State-of-the-art and routine approaches in physics : nuclear physics, condensed-matter, spin systems, quantum liquids, infrared spectroscopy . . . Still of confidential use for the electronic structure problem of quantum chemistry (as opposed to post-HF and DFT) Reason : Very high computational cost for small/medium systems But : Very favorable scaling with system size compared to standard methods Ideally suited to extreme parallelism A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  6. logos Quantum Monte Carlo The QMC=Chem code Quantum Monte Carlo

    methods Solve the Schrödinger equation with random walks State-of-the-art and routine approaches in physics : nuclear physics, condensed-matter, spin systems, quantum liquids, infrared spectroscopy . . . Still of confidential use for the electronic structure problem of quantum chemistry (as opposed to post-HF and DFT) Reason : Very high computational cost for small/medium systems But : Very favorable scaling with system size compared to standard methods Ideally suited to extreme parallelism A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  7. logos Quantum Monte Carlo The QMC=Chem code Quantum Monte Carlo

    methods Solve the Schrödinger equation with random walks State-of-the-art and routine approaches in physics : nuclear physics, condensed-matter, spin systems, quantum liquids, infrared spectroscopy . . . Still of confidential use for the electronic structure problem of quantum chemistry (as opposed to post-HF and DFT) Reason : Very high computational cost for small/medium systems But : Very favorable scaling with system size compared to standard methods Ideally suited to extreme parallelism A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  8. logos Quantum Monte Carlo The QMC=Chem code Quantum Monte Carlo

    for molecular systems Problem : Solve stochastically the Schrödinger equation for N electrons in a molecule E = dr1 . . . drNΦ(r1, . . . , rN)HΦ(r1, . . . , rN) dr1 . . . drNΦ(r1, . . . , rN)Φ(r1, . . . , rN) ∼ HΨ(r1, . . . , rN) Ψ(r1, . . . , rN) , sampled with (Ψ × Φ) H : Hamiltonian operator E : Energy r1, . . . , rN : Electron coordinates Φ : Exact wave function Ψ : Trial wave function A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  9. logos Quantum Monte Carlo The QMC=Chem code QMC in a

    few words Walker = 3N-dimensional vector containing the positions of the N electrons Stochastic trajectories for walkers (or set of electrons) To impose importance sampling we need of an approximate computable trial wavefunction ΨT which helps to drive the electronic trajectories into the important regions To get chemical properties, averages are computed along electronic trajectories Extreme parallelism : Independent populations of walkers (no communications) can be distributed on different CPUs A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  10. logos Quantum Monte Carlo The QMC=Chem code QMC in a

    few words Walker = 3N-dimensional vector containing the positions of the N electrons Stochastic trajectories for walkers (or set of electrons) To impose importance sampling we need of an approximate computable trial wavefunction ΨT which helps to drive the electronic trajectories into the important regions To get chemical properties, averages are computed along electronic trajectories Extreme parallelism : Independent populations of walkers (no communications) can be distributed on different CPUs A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  11. logos Quantum Monte Carlo The QMC=Chem code QMC in a

    few words Walker = 3N-dimensional vector containing the positions of the N electrons Stochastic trajectories for walkers (or set of electrons) To impose importance sampling we need of an approximate computable trial wavefunction ΨT which helps to drive the electronic trajectories into the important regions To get chemical properties, averages are computed along electronic trajectories Extreme parallelism : Independent populations of walkers (no communications) can be distributed on different CPUs A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  12. logos Quantum Monte Carlo The QMC=Chem code QMC in a

    few words Walker = 3N-dimensional vector containing the positions of the N electrons Stochastic trajectories for walkers (or set of electrons) To impose importance sampling we need of an approximate computable trial wavefunction ΨT which helps to drive the electronic trajectories into the important regions To get chemical properties, averages are computed along electronic trajectories Extreme parallelism : Independent populations of walkers (no communications) can be distributed on different CPUs A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  13. logos Quantum Monte Carlo The QMC=Chem code QMC Algorithm A.

    Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  14. logos Quantum Monte Carlo The QMC=Chem code Amyloid β peptide

    simulation on CURIE machine (GENCI/TGCC/CEA, France) First step of our scientific project on Alzheimer disease : Energy difference of the β-strand and the α-helix conformations of Aβ(28-35) (a 1302-dimensional PDE to solve ! !) ⇒ SUSTAINED 960 TFlops/s on 76 800 cores of CURIE A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  15. logos Quantum Monte Carlo The QMC=Chem code Outline 1 Quantum

    Monte Carlo 2 The QMC=Chem code A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  16. logos Quantum Monte Carlo The QMC=Chem code Implementation in QMC=Chem

    Block : Nwalk walkers executing Nstep steps Compute as many blocks as possible, as quickly as possible Block are independent : block averages have a Gaussian distribution N step N proc N walk CPU time A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  17. logos Quantum Monte Carlo The QMC=Chem code Implementation in QMC=Chem

    Block : Nwalk walkers executing Nstep steps Compute as many blocks as possible, as quickly as possible Block are independent : block averages have a Gaussian distribution N step N proc N walk CPU time A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  18. logos Quantum Monte Carlo The QMC=Chem code Implementation in QMC=Chem

    Block : Nwalk walkers executing Nstep steps Compute as many blocks as possible, as quickly as possible Block are independent : block averages have a Gaussian distribution N step N proc N walk CPU time A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  19. logos Quantum Monte Carlo The QMC=Chem code Parallelism in QMC=Chem

    All I/O and network communications are asynchronous Very small memory footprint (10—300 MiB/core) Master compute node Data Server Slave Compute node Manager Database Main worker thread Forwarder Forwarder Worker Worker Worker Network Thread I/O Thread Worker Worker Worker A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  20. logos Quantum Monte Carlo The QMC=Chem code Fault-tolerance Extreme parallelism

    −→ possible system failures Blocks are Gaussian → losing blocks doesn’t change the average Simulation survives to removal of any node Restart always possible from data base Forwarder Data Server Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder DataBase Data Server Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder Forwarder A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  21. logos Quantum Monte Carlo The QMC=Chem code QMC=Chem scaling Almost

    ideal scaling −→ single-core optimization is crucial. 0 200 400 600 800 1000 0 200 400 600 800 1000 Speed-up Number of 16-core nodes Infinite 3 hours (estimated) 1 hour (estimated) 10 minutes A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  22. logos Quantum Monte Carlo The QMC=Chem code Hot-spots in a

    Monte Carlo step 1 Matrix inversion O(N3) (DP, Intel MKL) 2 Sparse×dense matrix products O(N2) (SP, our implementation) Multiply one dense matrix AN×Nbasis with 5 sparse matrices {B1, . . . , B5}Nbasis×Nelec with the same non-zero pattern to produce 5 dense matrices {C1, . . . , C5}N×Nelec . Nbasis : number of basis functions, Nelec : number of electrons Nbasis ∼ 5 × Nelec , N = 2 × Nelec A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  23. logos Quantum Monte Carlo The QMC=Chem code Hot-spots in a

    Monte Carlo step 1 Matrix inversion O(N3) (DP, Intel MKL) 2 Sparse×dense matrix products O(N2) (SP, our implementation) Multiply one dense matrix AN×Nbasis with 5 sparse matrices {B1, . . . , B5}Nbasis×Nelec with the same non-zero pattern to produce 5 dense matrices {C1, . . . , C5}N×Nelec . Nbasis : number of basis functions, Nelec : number of electrons Nbasis ∼ 5 × Nelec , N = 2 × Nelec A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  24. logos Quantum Monte Carlo The QMC=Chem code Hot-spots in a

    Monte Carlo step 1 Matrix inversion O(N3) (DP, Intel MKL) 2 Sparse×dense matrix products O(N2) (SP, our implementation) Multiply one dense matrix AN×Nbasis with 5 sparse matrices {B1, . . . , B5}Nbasis×Nelec with the same non-zero pattern to produce 5 dense matrices {C1, . . . , C5}N×Nelec . Nbasis : number of basis functions, Nelec : number of electrons Nbasis ∼ 5 × Nelec , N = 2 × Nelec A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  25. logos Quantum Monte Carlo The QMC=Chem code Hot-spots in a

    Monte Carlo step 1 Matrix inversion O(N3) (DP, Intel MKL) 2 Sparse×dense matrix products O(N2) (SP, our implementation) Multiply one dense matrix AN×Nbasis with 5 sparse matrices {B1, . . . , B5}Nbasis×Nelec with the same non-zero pattern to produce 5 dense matrices {C1, . . . , C5}N×Nelec . Nbasis : number of basis functions, Nelec : number of electrons Nbasis ∼ 5 × Nelec , N = 2 × Nelec A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  26. logos Quantum Monte Carlo The QMC=Chem code Hot-spots in a

    Monte Carlo step 1 Matrix inversion O(N3) (DP, Intel MKL) 2 Sparse×dense matrix products O(N2) (SP, our implementation) Multiply one dense matrix AN×Nbasis with 5 sparse matrices {B1, . . . , B5}Nbasis×Nelec with the same non-zero pattern to produce 5 dense matrices {C1, . . . , C5}N×Nelec . Nbasis : number of basis functions, Nelec : number of electrons Nbasis ∼ 5 × Nelec , N = 2 × Nelec A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  27. logos Quantum Monte Carlo The QMC=Chem code Sparse-dense matrix products

    Choose loop order that permits vectorization C1 = 0. ; C2 = 0. ; C3 = 0. ; C4 = 0. ; C5 = 0. do i=1, Number of electrons do k=1, Number of non-zero AOs for electron i do j=1, Number of molecular orbitals C1(j,i) += A(j,indices(k,i))*B1(k,i) C2(j,i) += A(j,indices(k,i))*B2(k,i) C3(j,i) += A(j,indices(k,i))*B3(k,i) C4(j,i) += A(j,indices(k,i))*B4(k,i) C5(j,i) += A(j,indices(k,i))*B5(k,i) end do end do end do A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  28. logos Quantum Monte Carlo The QMC=Chem code Sparse-dense matrix products

    Hand-written optimizations for Sandy Bridge architecture : All arrays are 256-bit aligned (compiler directives) LDA are multiples of 8 (Single precision, 256 bit AVX) Unroll and jam to reduce nb of stores Loop distribution to avoid register spilling Blocking/sorting to reduce cache misses in access to A A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  29. logos Quantum Monte Carlo The QMC=Chem code Sparse-dense matrix products

    Hand-written optimizations for Sandy Bridge architecture : All arrays are 256-bit aligned (compiler directives) LDA are multiples of 8 (Single precision, 256 bit AVX) Unroll and jam to reduce nb of stores Loop distribution to avoid register spilling Blocking/sorting to reduce cache misses in access to A A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  30. logos Quantum Monte Carlo The QMC=Chem code Sparse-dense matrix products

    Hand-written optimizations for Sandy Bridge architecture : All arrays are 256-bit aligned (compiler directives) LDA are multiples of 8 (Single precision, 256 bit AVX) Unroll and jam to reduce nb of stores Loop distribution to avoid register spilling Blocking/sorting to reduce cache misses in access to A A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  31. logos Quantum Monte Carlo The QMC=Chem code Sparse-dense matrix products

    Hand-written optimizations for Sandy Bridge architecture : All arrays are 256-bit aligned (compiler directives) LDA are multiples of 8 (Single precision, 256 bit AVX) Unroll and jam to reduce nb of stores Loop distribution to avoid register spilling Blocking/sorting to reduce cache misses in access to A A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  32. logos Quantum Monte Carlo The QMC=Chem code Sparse-dense matrix products

    Hand-written optimizations for Sandy Bridge architecture : All arrays are 256-bit aligned (compiler directives) LDA are multiples of 8 (Single precision, 256 bit AVX) Unroll and jam to reduce nb of stores Loop distribution to avoid register spilling Blocking/sorting to reduce cache misses in access to A A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  33. logos Quantum Monte Carlo The QMC=Chem code Sparse-dense matrix products

    do i=1, Number of electrons do k=1, Number of non-zero AOs for electron i, 4 do j=1, Number of molecular orbitals C1(j,i) += A(j,indice(k ,i))*B1(k ,i) + A(j,indice(k+1,i))*B1(k+1,i) & + A(j,indice(k+2,i))*B1(k+2,i) + A(j,indice(k+3,i))*B1(k+3,i) C2(j,i) += A(j,indice(k ,i))*B2(k ,i) + A(j,indice(k+1,i))*B2(k+1,i) & + A(j,indice(k+2,i))*B2(k+2,i) + A(j,indice(k+3,i))*B2(k+3,i) end do do j=1, Number of molecular orbitals C3(j,i) += A(j,indice(k ,i))*B3(k ,i) + A(j,indice(k+1,i))*B3(k+1,i) & + A(j,indice(k+2,i))*B3(k+2,i) + A(j,indice(k+3,i))*B3(k+3,i) C4(j,i) += A(j,indice(k ,i))*B4(k ,i) + A(j,indice(k+1,i))*B4(k+1,i) & + A(j,indice(k+2,i))*B4(k+2,i) + A(j,indice(k+3,i))*B4(k+3,i) end do do j=1, Number of molecular orbitals ! Unrolled 2x by compiler C5(j,i) += A(j,indice(k ,i))*B5(k ,i) + A(j,indice(k+1,i))*B5(k+1,i) & + A(j,indice(k+2,i))*B5(k+2,i) + A(j,indice(k+3,i))*B5(k+3,i) end do end do end do A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  34. logos Quantum Monte Carlo The QMC=Chem code Sparse-dense matrix products

    Efficiency of the matrix products : Static analysis (MAQAO) : Full-AVX (no scalar operations), inner-most loops perform 16 flops/cycle Decremental analysis (DECAN) : good balance between flops and memory operations Up to 64% of the peak measured on Xeon E5 A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  35. logos Quantum Monte Carlo The QMC=Chem code Sparse-dense matrix products

    Efficiency of the matrix products : Static analysis (MAQAO) : Full-AVX (no scalar operations), inner-most loops perform 16 flops/cycle Decremental analysis (DECAN) : good balance between flops and memory operations Up to 64% of the peak measured on Xeon E5 A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  36. logos Quantum Monte Carlo The QMC=Chem code Sparse-dense matrix products

    Efficiency of the matrix products : Static analysis (MAQAO) : Full-AVX (no scalar operations), inner-most loops perform 16 flops/cycle Decremental analysis (DECAN) : good balance between flops and memory operations Up to 64% of the peak measured on Xeon E5 A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  37. logos Quantum Monte Carlo The QMC=Chem code Hot-spots in a

    Monte Carlo step TABLE: Single core performance (GFlops/s), % peak in parenthesis, Measured on Intel Xeon E31240, 3.30GHz (52.8 GFlops/s SP, 26.4 GFlops/s DP). Nelec Nbasis Matrix inversion Matrix products Overall (1 core) 158 404 6.3 (24%) 26.6 (50%) 8.8 (23%) 434 963 14.0 (53%) 33.1 (63%) 11.8 (33%) 434 2934 14.0 (53%) 33.6 (64%) 13.7 (38%) 1056 2370 17.9 (67%) 30.6 (58%) 15.2 (49%) 1731 3892 17.8 (67%) 28.2 (53%) 16.2 (55%) A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry
  38. logos Quantum Monte Carlo The QMC=Chem code Acknowkledgments : GENCI

    CALMIP CEA PRACE Intel Bull ANR A. Scemama, M. Caffarel, E. Oseret, W. Jalby Petascale QMC for chemistry