$30 off During Our Annual Pro Sale. View Details »

Accelerating Empirical Dynamic Modeling using H...

Accelerating Empirical Dynamic Modeling using High Performance Computing

This talk will present our efforts in accelerating and scaling Empirical Dynamic Modeling (EDM), a nonlinear nonparametric time series analysis framework, using the latest high performance computing hardware and software. I will discuss our massively parallel implementation of EDM named mpEDM that leverages multi-threading and GPU offloading. mpEDM scales up to 511 compute nodes (2,044 NVIDIA V100 GPUs) on the ABCI supercomputer and achieves up to 1,530x speedup in large-scale causal inference computation compared to a previous EDM library. I will also present our recent work on a fully GPU offloaded implementation of EDM based on the Kokkos performance portability framework.

Keichi Takahashi

August 20, 2021
Tweet

More Decks by Keichi Takahashi

Other Decks in Science

Transcript

  1. Keichi Takahashi [email protected] Software Design and Analysis Lab, Nara Institute

    of Science and Technology Accelerating Empirical Dynamic Modeling Using High Performance Computing University of Melbourne QMNET Talk 8/20/2021
  2. Agenda • Empirical Dynamic Modeling • mpEDM: Massively Parallel Empirical

    Dynamic Modeling • kEDM: Fully GPU-o ff l oaded Empirical Dynamic Modeling 2
  3. Empirical Dynamic Modeling (EDM) • A non-parametric method for modeling

    non-linear time series data. • Reconstructs the underlying attractor manifolds from time series data. • EDM methods: • Simplex projection (short-term forecasting) • S-Map (quantifying non-linearity) • Convergent Cross Mapping (identifying causal relationships) • … 4 C.W. Chang et al., “Empirical Dynamic Modeling for Beginners”, Ecological Research, vol. 32. , pp. 785—796, 2017.
  4. State Space Reconstruction (SSR) 5 Shadow Manifold Reconstructed Manifold Observations

    Time 
 lags Di ff eomorphism 
 Topological features are preserved Ethan R. Deyle, George Sugihara, “Generalized Theorems for Nonlinear State Space Reconstruction”, PLoS ONE, vol. 6, no. 3, e18295, 2011.
  5. Simplex Projection (Short-term Forecast) 6 1. Create time-delayed embedding of

    : 2. Find k-nearest neighbors of in the state space: 3. Compute exponentially weighted average of neighbors: X(t) x(t) = (X(t), X(t − τ), …, X(t − (E − 1)τ)) x(t) x(t1 ), x(t2 ), …, x(tE+1 ) ̂ x(t + Tp ) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ x(ti + Tp ) wi = exp { − ∥x(t) − x(ti )∥ min ∥x(t) − x(ti )∥ } steps 
 forward Tp x(t1 + Tp ) x(t2 ) x(t1 ) x(t2 + Tp ) x(t3 ) x(t3 + Tp )
  6. 1. Create time-delayed embedding of both and 2. Find k-nearest

    neighbors of in the state space: 3. Compute exponentially weighted average of neighbors: 
 
 
 
 4. If predicts with high accuracy, “CCM-causes” X(t) Y(t) x(t) x(t1 ), x(t2 ), …, x(tE+1 ) X(t) Y(t) Y(t) X(t) Convergent Cross Mapping (Causal Inference) 7 ̂ y(t + Tp ) = E+1 ∑ i=1 wi ∑E+1 i=1 wi ⋅ y(ti + Tp ) wi = exp { − ∥x(t) − x(ti )∥ min ∥x(t) − x(ti )∥ }
  7. Our Use Case • We employ CCM to investigate the

    causal relationships between every neuron in a larval Zebra fi sh brain. • This requires pairwise CCM between 100K time series each with 10K time steps. 
 10 billion cross mappings! 8 Larval Zebra fi sh Light Sheet Microscopy Neuronal Activity Causal Relationships
  8. Sugihara Lab EDM implementations 9 rEDM cppEDM C++ STL LAPACK

    pyEDM Language bindings Core library Dependencies
  9. Prototype using cppEDM • cppEDM on 510 ABCI compute nodes

    took 8.5 hours to analyze 53K time series each with 1.5K time steps. • Assuming ideal scaling, our target dataset (100K time series w/ 10K time steps) will take ~60 days to analyze on 510 compute nodes. 10 Joseph Park, Gerald Pao, Cameron Smith and George Sugihara, “Massively Parallel Empirical Dynamic Cross Mapping”, 37th Paci fi c Rim Applications and Grid Middleware Assembly Workshop (PRAGMA 37), Sep. 2019.
  10. Observations from profiling cppEDM • k-nearest neighbor search is the

    primary bottleneck • Should focus optimization e ff ort • Memory allocation is also taking time • Due to dynamically allocating bu ff ers during computation • Large load imbalance between compute nodes • Due to static decomposition of work among nodes 11 k-NN
 Search Memory
 Allocation Others Runtime [%] 0 15 30 45 60 Wassapon Watanakeesuntorn, Kohei Ichikawa, Keichi Takahashi, Jason Haga, Gerald Pao, “Calculation of Complete Zebra fi sh Brain Neural Activities on ABCI”, 36th Paci fi c Rim Applications and Grid Middleware Assembly Workshop (PRAGMA 36), Apr. 2019.
  11. Key Idea: Pre-computed Lookup Tables of Neighbors 13 ̂ y1

    (t) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ y1 (ti ) ̂ y2 (t) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ y2 (ti ) ̂ y3 (t) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ y3 (ti ) When computing one-to-all CCM, the distances and indices of k-nearest neighbors 
 only need to be computed once. CCM
  12. Improved Pairwise CCM Algorithm 14 idx 0 25 8 47

    1 13 46 6 2 22 48 43 3 12 29 11 idx 0 0.9 0.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 Run all k-NN search in state space Batch predict every other time series (cause) Embed it into state space using time lags For every time series (e ff ect)
  13. k-Nearest Neighbor Search Algorithms 15 Exact methods Approximate methods Full

    search Proximity Graphs Locality Sensitive Hashing Space Partition Faiss, NMSLIB, NGT FALCONN Annoy, FLANN, ScaNN Faiss Product Quantization Faiss, ScaNN Yusuke Matsui, “Billion-scale Approximate Nearest Neighbor Search”, CVPR 2020 Tutorial. k-NN Search
  14. Choosing the k-NN search algorithm • Data dimension (i.e. embedding

    dimension) < 20 • Relatively low dimension • Number of points (i.e. length of time series) < 10,000 • Cost of constructing index structures doesn’t pay o ff • Distance matrix can fi t in CPU/GPU memory (4*10,000^2=400MB) • Need exact solution • Unknown how approximation a ff ects the CCM result 16 We take the brute force search approach
  15. CUDA Ecosystem 17 CUDA C/C++/Fortran OpenMP OpenACC Thrust CUB cuBLAS

    cuDNN SYCL cuFFT cuTENSOR cuSPARSE cuSOLVER HIP ArrayFire PyTorch TensorFlow cupy Kokkos Math libraries and 
 Parallel primitives Programming models High-level libraries
  16. Hierarchical Parallelism 18 Whole Cluster 
 537 PFLOPS 432 Racks/Cluster

    Rack 
 1.3 PFLOPS Node 
 3.38 TFLOPS Core 
 70 GFLOPS 384 Nodes/Rack 48 Cores/CPU Core
  17. Inter-node Design 21 Master Worker Parallel File System (Lustre) On-demand

    Burst Buffer (BeeOND) Worker Worker Worker Worker … Bulk reads/writes Workers asynchronously fetch and execute work Small reads/writes
  18. id x 0 25 8 47 1 13 46 6

    2 22 48 43 3 12 29 11 id x 0 0.90.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 GPU Step 2 All k-NN Intra-node Design 22 id x 0 25 8 47 1 13 46 6 2 22 48 43 3 12 29 11 id x 0 0.90.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 … id x 0 25 8 47 1 13 46 6 2 22 48 43 3 12 29 11 id x 0 0.90.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 GPU CPU id 0 0.9 0.05 0.05 1 0.50.40.1 2 0.60.30.1 3 0.80.10.1 Step 1 Embedding Step 2 All k-NN Step 3 Batch prediction Embedding Neighbors
  19. ArrayFire • A general purpose library for GPU computing •

    Provides a numpy-like high level interface • Backend invokes vendor-optimized libraries (cuBLAS, cuSPARSE, cuSOLVER, cuDNN, etc.) 23 https://array fi re.com/
  20. mpEDM 24 Wassapon Watanakeesuntorn, Keichi Takahashi, Kohei Ichikawa, Joseph Park,

    George Sugihara, Ryousei Takano, Jason Haga, Gerald M. Pao, "Massively Parallel Causal Inference of Whole Brain Dynamics at Single Neuron Resolution", 26th International Conference on Parallel and Distributed Systems (ICPADS 2020), Dec. 2020. • Massively parallel EDM designed for GPU-centric supercomputers (i.e. ABCI) • Currently supports Simplex projection and CCM • GitHub: https://github.com/keichi/mpEDM
  21. # of time series # of time steps cppEDM 


    512 nodes mpEDM 
 1 node mpEDM 
 512 nodes Fish1_Normo 1,450 53,053 8.5h 1,973s 20s Subject6 3,780 92,538 176h* 13,953s 101s Subject11 8,528 101,729 1,081h* 39,572s 199s Evaluation: CCM Runtime 25 1,530x faster 7,941x cheaper 
 (8,704 USD→1.1 USD) * Extrapolated from Fish1_Normo runtime, not measured ABCI usage fee is 220 JPY (2 USD) per node hour https://abci.ai/en/how_to_use/tari ff s.html
  22. Evaluation: Inter-node scalability 26 0 200 400 600 800 1,000

    1,200 1,400 Single Node 1 Worker 2 Workers 4 Workers 8 Workers 16 Workers 32 Workers 64 Workers 128 Workers 256 Workers 511 Workers Time [min] CPU GPU 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 Speedup Number of workers CPU GPU Strong scaling performance 
 (absolute runtime) Strong scaling performance 
 (relative speedup)
  23. Evaluation: Intra-node scalability 27 0 20 40 60 80 100

    1,000 5,000 10,000 50,000 100,000 % of total Number of time series kNN Lookup 0 20 40 60 80 100 1,000 2,000 5,000 10,000 20,000 40,000 Number of time steps 0 100 200 300 400 500 600 700 800 900 1,000 2,000 5,000 10,000 20,000 40,000 Time [s] Number of time steps Simplex Projection CCM Others 0 100 200 300 400 500 600 700 800 900 1,000 1,000 5,000 10,000 50,000 100,000 Time [min] Number of time series Simplex Projection CCM Others Runtime 
 1K-100K time series 
 10K time steps Runtime breakdown Runtime 
 1K time series 
 1K-40K time steps
  24. Challenges remained in mpEDM • Using ArrayFire’s k-NN search misses

    optimization opportunities • For example, k-NN search could be fused with embedding to improve e ffi ciency • k is usually small (<20) in EDM but ArrayFire’s k-NN is tuned for a general case • Batch prediction (lookup) is executed on CPU • We tried to implement it on GPU using ArrayFire (GFOR, sparse arrays, etc.) but performance was poor • Performance portability • CPU version had to be separately maintained for CPU-only platforms 28
  25. NVIDIA GPU Hardware Architecture 30 L2 Cache Global Memory …

    L1$: 14TB/s L2$: 3TB/s HBM: 800GB/s L1 Cache Registers Scheduler CUDA Cores SM L1 Cache Registers Scheduler CUDA Cores SM L1 Cache Registers Scheduler CUDA Cores SM
  26. Grid 0 CUDA Programming: Grids of Blocks of Threads 31

    __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } saxpy<<<4096,256>>>(N, 2.0, d_x, d_y); GPU Code (“kernel”) CPU Code Block 0 Block 1 Block 1 Block 2 Thread 0 … Thread 1 Thread 2 … Launch
  27. Hardware to Software Mapping 32 L2 Cache Device Memory L1

    Cache Registers Scheduler CUDA Cores SM L1 Cache Registers Scheduler CUDA Cores SM Grid 0 Block 0 Block 1 Block 1 Block 2 Thread 0 … Thread 1 Thread 2 … CUDA Core executes threads SM executes thread blocks GPU executes grids
  28. Distance matrix D k-Nearest Neighbor Search (Distance Calculation) 33 Dij

    = E−1 ∑ k=0 X(i − kτ) − X(j − kτ) X(t) Thread block computes row 
 is cached in shared memory i i X(i − kτ) Thread in block computes one element (time- delayed embedding is performed on-the- fl y) j i
  29. Distance matrix D k-Nearest Neighbor Search (Top-k Search) 34 Thread

    block searches the top-k elements in row i i Each thread scans a segment in the row and maintains the top-k elements it has seen Leader threads merges the local top-k elements k k
  30. Batch prediction 35 idx 0 25 8 47 1 13

    46 6 2 22 48 43 3 12 29 11 idx 0 0.9 0.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 ̂ Yi (j) = E+1 ∑ k=1 w(j, k) ⋅ Yi (t(j, k)) Thread block makes prediction for time series (time series is cached in shared memory) i i Thread predicts time step in time series j j i
  31. Kokkos • A performance portability framework • Same source code

    runs on many di ff erent architectures. • Supports OpenMP, CUDA, HIP, SYCL and HPX as backends. • Implemented as a C++ template library (no special compiler required). • Rich ecosystem • kokkos-tools: pro fi lers, tracers, debuggers, and adapters to vendor tools. • kokkos-kernels: dense and sparse linear algebra kernels. • Established since 2012, many production applications. 36
  32. Kokkos Core Concepts • Data parallel patterns • Parallel pattern:

    for, reduce, scan • Execution policy: scheduling, thread groups • Body: functor/lambda • Views • Multi-dimensional arrays • Automatically deallocated (Reference counted) • Layout is architecture-dependent 37 parallel_for(RangePolicy<ExecSpace>(N), KOKKOS_LAMBDA(int i) { y(i) = a * x(i) + y(i); }); Pattern Execution Policy Body View<float**> … … Row-major on CPU Column-major on GPU
  33. kEDM • EDM based on Kokkos • Currently supports Simplex

    projection, CCM, and S-Map • Documentation: https://kedm.readthedocs.io/en/latest/ • CPU version can be installed from PyPI • GPU version currently needs to be installed from source 38 $ pip install kedm Keichi Takahashi, Wassapon Watanakeesuntorn, Kohei Ichikawa, Joseph Park, Ryousei Takano, Jason Haga, George Sugihara, Gerald M. Pao, “kEDM: A Performance-portable Implementation of Empirical Dynamical Modeling,” Practice & Experience in Advanced Research Computing (PEARC 2021), Jul. 2021.
  34. Evaluation Environment • Ika@Salk: Intel Xeon Gold 6148 (two sockets),

    NVIDIA V100 and 384GiB RAM • Aori@Salk: AMD EPYC 7742 (two sockets) and 1TiB RAM 39
  35. Evaluation: CCM Runtime for Real-world Datasets 40 # of time

    series # of time steps kEDM mpEDM Speedup kEDM mpEDM Speedup Fish1_Normo_ Trim 154 1,600 3s 11s 3.67x 3s 4s 1.33x Fly80XY 82 10,608 11s 50s 4.55x 22s 30s 1.36x Genes_MEF 45,318 96 344s 334s 0.97x 96s 139s 1.45x Subject6 92,538 3,780 5,391s 29,579s 5.49x 12,145s 11,571s 0.95x Subject11 101,729 8,528 20,517s 85,217s 4.15x 43,812s 38,542s 0.88x F1 8,520 29,484 11,372s 20,128s 1.77x 23,001s 19,950s 0.87x V100 and Xeon Gold 6148 EPYC 7742
  36. Evaluation: Micro Benchmark for k-Nearest Neighbor Search • Measured using

    an arti fi cial time series with 10K time steps. • EPYC: kEDM slightly (~5%) outperforms mpEDM. • V100: kEDM is 6.4x faster than mpEDM when E=1 but kEDM’s sorting performance rapidly degrades as E increases. 41 AMD EPYC 7742 NVIDIA V100
  37. Evaluation: Micro Benchmark for Lookups • Measured using 100K arti

    fi cial time series each having 10K time steps. • EPYC: kEDM is slightly faster than mpEDM. • V100: kEDM outperforms EPYC by a factor of 3 to 4. 42 AMD EPYC 7742 NVIDIA V100
  38. Evaluation: Roofline Analysis of Pairwise Distance Kernel • Larger E

    increases the arithmetic intensity (more data reuse). • EPYC: initially bounded by L3 cache; then hits L1 and L2 roo fl ines. • V100: initially bounded by HBM; larger E saturates load/store units. • V100: can’t use vectorized loads since the memory alignment depends on user-supplied parameters E and tau. 43 AMD EPYC 7742 NVIDIA V100
  39. Evaluation: Roofline Analysis of Lookup Kernel • Again, larger E

    increases the arithmetic intensity. • Bounded by L1 and L2 cache bandwidth on EPYC and V100, respectively. • k-NN lookup table fi ts on EYPC’s L1 cache (4MiB) and V100’s L2 cache (6MiB). 44 AMD EPYC 7742 NVIDIA V100
  40. 45 References 1. Keichi Takahashi et al., “kEDM: A Performance-portable

    Implementation of Empirical Dynamical Modeling,” 
 Practice & Experience in Advanced Research Computing (PEARC 2021), Jul. 2021. 2. Wassapon Watanakeesuntorn et al., "Massively Parallel Causal Inference of Whole Brain Dynamics at Single Neuron Resolution", 
 26th International Conference on Parallel and Distributed Systems (ICPADS 2020), Dec. 2020. 3. Joseph Park, Gerald Pao, Cameron Smith and George Sugihara, “Massively Parallel Empirical Dynamic Cross Mapping”, 
 37th Paci fi c Rim Applications and Grid Middleware Assembly Workshop (PRAGMA 37), Sep. 2019. Gerald Pao Wassapon Watanakeesuntorn Acknowledgments Kohei Ichikawa Joseph Park Ryosei Takano Jason Haga George Sugihara Software kEDM GitHub: https://github.com/keichi/kEDM kEDM Documentation: https://kedm.readthedocs.io/en/latest/ mpEDM GitHub: https://github.com/keichi/mpEDM