Accelerating Empirical Dynamic Modeling using High Performance Computing

Keichi Takahashi [email protected] Software Design and Analysis Lab, Nara Institute
of Science and Technology Accelerating Empirical Dynamic Modeling Using High Performance Computing University of Melbourne QMNET Talk 8/20/2021

Agenda • Empirical Dynamic Modeling • mpEDM: Massively Parallel Empirical
Dynamic Modeling • kEDM: Fully GPU-o ff l oaded Empirical Dynamic Modeling 2

Empirical Dynamic Modeling

Empirical Dynamic Modeling (EDM) • A non-parametric method for modeling
non-linear time series data. • Reconstructs the underlying attractor manifolds from time series data. • EDM methods: • Simplex projection (short-term forecasting) • S-Map (quantifying non-linearity) • Convergent Cross Mapping (identifying causal relationships) • … 4 C.W. Chang et al., “Empirical Dynamic Modeling for Beginners”, Ecological Research, vol. 32. , pp. 785—796, 2017.

State Space Reconstruction (SSR) 5 Shadow Manifold Reconstructed Manifold Observations
Time   lags Di ff eomorphism   Topological features are preserved Ethan R. Deyle, George Sugihara, “Generalized Theorems for Nonlinear State Space Reconstruction”, PLoS ONE, vol. 6, no. 3, e18295, 2011.

Simplex Projection (Short-term Forecast) 6 1. Create time-delayed embedding of
: 2. Find k-nearest neighbors of in the state space: 3. Compute exponentially weighted average of neighbors: X(t) x(t) = (X(t), X(t − τ), …, X(t − (E − 1)τ)) x(t) x(t1 ), x(t2 ), …, x(tE+1 ) ̂ x(t + Tp ) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ x(ti + Tp ) wi = exp { − ∥x(t) − x(ti )∥ min ∥x(t) − x(ti )∥ } steps   forward Tp x(t1 + Tp ) x(t2 ) x(t1 ) x(t2 + Tp ) x(t3 ) x(t3 + Tp )

1. Create time-delayed embedding of both and 2. Find k-nearest
neighbors of in the state space: 3. Compute exponentially weighted average of neighbors:         4. If predicts with high accuracy, “CCM-causes” X(t) Y(t) x(t) x(t1 ), x(t2 ), …, x(tE+1 ) X(t) Y(t) Y(t) X(t) Convergent Cross Mapping (Causal Inference) 7 ̂ y(t + Tp ) = E+1 ∑ i=1 wi ∑E+1 i=1 wi ⋅ y(ti + Tp ) wi = exp { − ∥x(t) − x(ti )∥ min ∥x(t) − x(ti )∥ }

Our Use Case • We employ CCM to investigate the
causal relationships between every neuron in a larval Zebra fi sh brain. • This requires pairwise CCM between 100K time series each with 10K time steps.   10 billion cross mappings! 8 Larval Zebra fi sh Light Sheet Microscopy Neuronal Activity Causal Relationships

Sugihara Lab EDM implementations 9 rEDM cppEDM C++ STL LAPACK
pyEDM Language bindings Core library Dependencies

Prototype using cppEDM • cppEDM on 510 ABCI compute nodes
took 8.5 hours to analyze 53K time series each with 1.5K time steps. • Assuming ideal scaling, our target dataset (100K time series w/ 10K time steps) will take ~60 days to analyze on 510 compute nodes. 10 Joseph Park, Gerald Pao, Cameron Smith and George Sugihara, “Massively Parallel Empirical Dynamic Cross Mapping”, 37th Paci fi c Rim Applications and Grid Middleware Assembly Workshop (PRAGMA 37), Sep. 2019.

Observations from profiling cppEDM • k-nearest neighbor search is the
primary bottleneck • Should focus optimization e ff ort • Memory allocation is also taking time • Due to dynamically allocating bu ff ers during computation • Large load imbalance between compute nodes • Due to static decomposition of work among nodes 11 k-NN  Search Memory  Allocation Others Runtime [%] 0 15 30 45 60 Wassapon Watanakeesuntorn, Kohei Ichikawa, Keichi Takahashi, Jason Haga, Gerald Pao, “Calculation of Complete Zebra fi sh Brain Neural Activities on ABCI”, 36th Paci fi c Rim Applications and Grid Middleware Assembly Workshop (PRAGMA 36), Apr. 2019.

mpEDM: Massively Parallel EDM

Key Idea: Pre-computed Lookup Tables of Neighbors 13 ̂ y1
(t) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ y1 (ti ) ̂ y2 (t) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ y2 (ti ) ̂ y3 (t) = E+1 ∑ i=1 wi ∑E+1 j=1 wj ⋅ y3 (ti ) When computing one-to-all CCM, the distances and indices of k-nearest neighbors   only need to be computed once. CCM

Improved Pairwise CCM Algorithm 14 idx 0 25 8 47
1 13 46 6 2 22 48 43 3 12 29 11 idx 0 0.9 0.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 Run all k-NN search in state space Batch predict every other time series (cause) Embed it into state space using time lags For every time series (e ff ect)

k-Nearest Neighbor Search Algorithms 15 Exact methods Approximate methods Full
search Proximity Graphs Locality Sensitive Hashing Space Partition Faiss, NMSLIB, NGT FALCONN Annoy, FLANN, ScaNN Faiss Product Quantization Faiss, ScaNN Yusuke Matsui, “Billion-scale Approximate Nearest Neighbor Search”, CVPR 2020 Tutorial. k-NN Search

Choosing the k-NN search algorithm • Data dimension (i.e. embedding
dimension) < 20 • Relatively low dimension • Number of points (i.e. length of time series) < 10,000 • Cost of constructing index structures doesn’t pay o ff • Distance matrix can fi t in CPU/GPU memory (4*10,000^2=400MB) • Need exact solution • Unknown how approximation a ff ects the CCM result 16 We take the brute force search approach

CUDA Ecosystem 17 CUDA C/C++/Fortran OpenMP OpenACC Thrust CUB cuBLAS
cuDNN SYCL cuFFT cuTENSOR cuSPARSE cuSOLVER HIP ArrayFire PyTorch TensorFlow cupy Kokkos Math libraries and   Parallel primitives Programming models High-level libraries

Hierarchical Parallelism 18 Whole Cluster   537 PFLOPS 432 Racks/Cluster
Rack   1.3 PFLOPS Node   3.38 TFLOPS Core   70 GFLOPS 384 Nodes/Rack 48 Cores/CPU Core

AI Bridging Cloud Infrastructure (ABCI) 19

ABCI Compute Node (Type V) 20

Inter-node Design 21 Master Worker Parallel File System (Lustre) On-demand
Burst Buffer (BeeOND) Worker Worker Worker Worker … Bulk reads/writes Workers asynchronously fetch and execute work Small reads/writes

id x 0 25 8 47 1 13 46 6
2 22 48 43 3 12 29 11 id x 0 0.90.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 GPU Step 2 All k-NN Intra-node Design 22 id x 0 25 8 47 1 13 46 6 2 22 48 43 3 12 29 11 id x 0 0.90.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 … id x 0 25 8 47 1 13 46 6 2 22 48 43 3 12 29 11 id x 0 0.90.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 GPU CPU id 0 0.9 0.05 0.05 1 0.50.40.1 2 0.60.30.1 3 0.80.10.1 Step 1 Embedding Step 2 All k-NN Step 3 Batch prediction Embedding Neighbors

ArrayFire • A general purpose library for GPU computing •
Provides a numpy-like high level interface • Backend invokes vendor-optimized libraries (cuBLAS, cuSPARSE, cuSOLVER, cuDNN, etc.) 23 https://array fi re.com/

mpEDM 24 Wassapon Watanakeesuntorn, Keichi Takahashi, Kohei Ichikawa, Joseph Park,
George Sugihara, Ryousei Takano, Jason Haga, Gerald M. Pao, "Massively Parallel Causal Inference of Whole Brain Dynamics at Single Neuron Resolution", 26th International Conference on Parallel and Distributed Systems (ICPADS 2020), Dec. 2020. • Massively parallel EDM designed for GPU-centric supercomputers (i.e. ABCI) • Currently supports Simplex projection and CCM • GitHub: https://github.com/keichi/mpEDM

# of time series # of time steps cppEDM  
512 nodes mpEDM   1 node mpEDM   512 nodes Fish1_Normo 1,450 53,053 8.5h 1,973s 20s Subject6 3,780 92,538 176h* 13,953s 101s Subject11 8,528 101,729 1,081h* 39,572s 199s Evaluation: CCM Runtime 25 1,530x faster 7,941x cheaper   (8,704 USD→1.1 USD) * Extrapolated from Fish1_Normo runtime, not measured ABCI usage fee is 220 JPY (2 USD) per node hour https://abci.ai/en/how_to_use/tari ff s.html

Evaluation: Inter-node scalability 26 0 200 400 600 800 1,000
1,200 1,400 Single Node 1 Worker 2 Workers 4 Workers 8 Workers 16 Workers 32 Workers 64 Workers 128 Workers 256 Workers 511 Workers Time [min] CPU GPU 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 Speedup Number of workers CPU GPU Strong scaling performance   (absolute runtime) Strong scaling performance   (relative speedup)

Evaluation: Intra-node scalability 27 0 20 40 60 80 100
1,000 5,000 10,000 50,000 100,000 % of total Number of time series kNN Lookup 0 20 40 60 80 100 1,000 2,000 5,000 10,000 20,000 40,000 Number of time steps 0 100 200 300 400 500 600 700 800 900 1,000 2,000 5,000 10,000 20,000 40,000 Time [s] Number of time steps Simplex Projection CCM Others 0 100 200 300 400 500 600 700 800 900 1,000 1,000 5,000 10,000 50,000 100,000 Time [min] Number of time series Simplex Projection CCM Others Runtime   1K-100K time series   10K time steps Runtime breakdown Runtime   1K time series   1K-40K time steps

Challenges remained in mpEDM • Using ArrayFire’s k-NN search misses
optimization opportunities • For example, k-NN search could be fused with embedding to improve e ffi ciency • k is usually small (<20) in EDM but ArrayFire’s k-NN is tuned for a general case • Batch prediction (lookup) is executed on CPU • We tried to implement it on GPU using ArrayFire (GFOR, sparse arrays, etc.) but performance was poor • Performance portability • CPU version had to be separately maintained for CPU-only platforms 28

kEDM: Fully GPU-offloaded EDM

NVIDIA GPU Hardware Architecture 30 L2 Cache Global Memory …
L1$: 14TB/s L2$: 3TB/s HBM: 800GB/s L1 Cache Registers Scheduler CUDA Cores SM L1 Cache Registers Scheduler CUDA Cores SM L1 Cache Registers Scheduler CUDA Cores SM

Grid 0 CUDA Programming: Grids of Blocks of Threads 31
__global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = a * x[i] + y[i]; } saxpy<<<4096,256>>>(N, 2.0, d_x, d_y); GPU Code (“kernel”) CPU Code Block 0 Block 1 Block 1 Block 2 Thread 0 … Thread 1 Thread 2 … Launch

Hardware to Software Mapping 32 L2 Cache Device Memory L1
Cache Registers Scheduler CUDA Cores SM L1 Cache Registers Scheduler CUDA Cores SM Grid 0 Block 0 Block 1 Block 1 Block 2 Thread 0 … Thread 1 Thread 2 … CUDA Core executes threads SM executes thread blocks GPU executes grids

Distance matrix D k-Nearest Neighbor Search (Distance Calculation) 33 Dij
= E−1 ∑ k=0 X(i − kτ) − X(j − kτ) X(t) Thread block computes row   is cached in shared memory i i X(i − kτ) Thread in block computes one element (time- delayed embedding is performed on-the- fl y) j i

Distance matrix D k-Nearest Neighbor Search (Top-k Search) 34 Thread
block searches the top-k elements in row i i Each thread scans a segment in the row and maintains the top-k elements it has seen Leader threads merges the local top-k elements k k

Batch prediction 35 idx 0 25 8 47 1 13
46 6 2 22 48 43 3 12 29 11 idx 0 0.9 0.05 0.05 1 0.5 0.4 0.1 2 0.6 0.3 0.1 3 0.8 0.1 0.1 ̂ Yi (j) = E+1 ∑ k=1 w(j, k) ⋅ Yi (t(j, k)) Thread block makes prediction for time series (time series is cached in shared memory) i i Thread predicts time step in time series j j i

Kokkos • A performance portability framework • Same source code
runs on many di ff erent architectures. • Supports OpenMP, CUDA, HIP, SYCL and HPX as backends. • Implemented as a C++ template library (no special compiler required). • Rich ecosystem • kokkos-tools: pro fi lers, tracers, debuggers, and adapters to vendor tools. • kokkos-kernels: dense and sparse linear algebra kernels. • Established since 2012, many production applications. 36

Kokkos Core Concepts • Data parallel patterns • Parallel pattern:
for, reduce, scan • Execution policy: scheduling, thread groups • Body: functor/lambda • Views • Multi-dimensional arrays • Automatically deallocated (Reference counted) • Layout is architecture-dependent 37 parallel_for(RangePolicy<ExecSpace>(N), KOKKOS_LAMBDA(int i) { y(i) = a * x(i) + y(i); }); Pattern Execution Policy Body View<float**> … … Row-major on CPU Column-major on GPU

kEDM • EDM based on Kokkos • Currently supports Simplex
projection, CCM, and S-Map • Documentation: https://kedm.readthedocs.io/en/latest/ • CPU version can be installed from PyPI • GPU version currently needs to be installed from source 38 $ pip install kedm Keichi Takahashi, Wassapon Watanakeesuntorn, Kohei Ichikawa, Joseph Park, Ryousei Takano, Jason Haga, George Sugihara, Gerald M. Pao, “kEDM: A Performance-portable Implementation of Empirical Dynamical Modeling,” Practice & Experience in Advanced Research Computing (PEARC 2021), Jul. 2021.

Evaluation Environment • Ika@Salk: Intel Xeon Gold 6148 (two sockets),
NVIDIA V100 and 384GiB RAM • Aori@Salk: AMD EPYC 7742 (two sockets) and 1TiB RAM 39

Evaluation: CCM Runtime for Real-world Datasets 40 # of time
series # of time steps kEDM mpEDM Speedup kEDM mpEDM Speedup Fish1_Normo_ Trim 154 1,600 3s 11s 3.67x 3s 4s 1.33x Fly80XY 82 10,608 11s 50s 4.55x 22s 30s 1.36x Genes_MEF 45,318 96 344s 334s 0.97x 96s 139s 1.45x Subject6 92,538 3,780 5,391s 29,579s 5.49x 12,145s 11,571s 0.95x Subject11 101,729 8,528 20,517s 85,217s 4.15x 43,812s 38,542s 0.88x F1 8,520 29,484 11,372s 20,128s 1.77x 23,001s 19,950s 0.87x V100 and Xeon Gold 6148 EPYC 7742

Evaluation: Micro Benchmark for k-Nearest Neighbor Search • Measured using
an arti fi cial time series with 10K time steps. • EPYC: kEDM slightly (~5%) outperforms mpEDM. • V100: kEDM is 6.4x faster than mpEDM when E=1 but kEDM’s sorting performance rapidly degrades as E increases. 41 AMD EPYC 7742 NVIDIA V100

Evaluation: Micro Benchmark for Lookups • Measured using 100K arti
fi cial time series each having 10K time steps. • EPYC: kEDM is slightly faster than mpEDM. • V100: kEDM outperforms EPYC by a factor of 3 to 4. 42 AMD EPYC 7742 NVIDIA V100

Evaluation: Roofline Analysis of Pairwise Distance Kernel • Larger E
increases the arithmetic intensity (more data reuse). • EPYC: initially bounded by L3 cache; then hits L1 and L2 roo fl ines. • V100: initially bounded by HBM; larger E saturates load/store units. • V100: can’t use vectorized loads since the memory alignment depends on user-supplied parameters E and tau. 43 AMD EPYC 7742 NVIDIA V100

Evaluation: Roofline Analysis of Lookup Kernel • Again, larger E
increases the arithmetic intensity. • Bounded by L1 and L2 cache bandwidth on EPYC and V100, respectively. • k-NN lookup table fi ts on EYPC’s L1 cache (4MiB) and V100’s L2 cache (6MiB). 44 AMD EPYC 7742 NVIDIA V100

45 References 1. Keichi Takahashi et al., “kEDM: A Performance-portable
Implementation of Empirical Dynamical Modeling,”   Practice & Experience in Advanced Research Computing (PEARC 2021), Jul. 2021. 2. Wassapon Watanakeesuntorn et al., "Massively Parallel Causal Inference of Whole Brain Dynamics at Single Neuron Resolution",   26th International Conference on Parallel and Distributed Systems (ICPADS 2020), Dec. 2020. 3. Joseph Park, Gerald Pao, Cameron Smith and George Sugihara, “Massively Parallel Empirical Dynamic Cross Mapping”,   37th Paci fi c Rim Applications and Grid Middleware Assembly Workshop (PRAGMA 37), Sep. 2019. Gerald Pao Wassapon Watanakeesuntorn Acknowledgments Kohei Ichikawa Joseph Park Ryosei Takano Jason Haga George Sugihara Software kEDM GitHub: https://github.com/keichi/kEDM kEDM Documentation: https://kedm.readthedocs.io/en/latest/ mpEDM GitHub: https://github.com/keichi/mpEDM

Accelerating Empirical Dynamic Modeling using H...

Accelerating Empirical Dynamic Modeling using High Performance Computing

More Decks by Keichi Takahashi

Other Decks in Science

Featured

Transcript