Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PFAnalyzer: A Toolset for Analyzing Application...

Keichi Takahashi
September 05, 2017

PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects

My talk at HPCMASPA 2017 (held in conjunction with IEEE Cluster 2017)

Keichi Takahashi

September 05, 2017
Tweet

More Decks by Keichi Takahashi

Other Decks in Research

Transcript

  1. PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects Keichi Takahashi,

    Susumu Date, Dashdavaa Khureltulga, Yoshiyuki Kido, Shinji Shimojo Cybermedia Center, Osaka University
  2. PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects Keichi Takahashi,

    Susumu Date, Dashdavaa Khureltulga, Yoshiyuki Kido, Shinji Shimojo Cybermedia Center, Osaka University
  3. Challenges in Future Interconnects Over-provisioned designs might not scale well

    ‣ Interconnects can consume up to 50% of total power [1] and 1/3 of total budget of a cluster [2] ‣ Properties such as full bisection bandwidth and non-blocking may become increasingly difficult to achieve ‣ Need to improve the utilization of the interconnect Our proposal is to adopt: ‣ Dynamic (adaptive) routing ‣ Application-awareness network control 2 [1] J. Kim et al.“Flattened Butterfly : A Cost-Efficient Topology for High-Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007. [2] D. Abts et al., “Energy proportional datacenter networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010. [3] S. Kamil, L. Oliker, A. Pinar, and J. Shalf, “Communication Requirements and Interconnect Optimization for High-End Scientific Applications,” IEEE Trans. Parallel Distrib. Syst., vol. 21, no. 2, pp. 188–202, 2010.
  4. SDN-enhanced MPI Framework Our prototype framework that integrates SDN into

    MPI [4,5,6] ‣ Dynamically controls the interconnect based on the communication pattern of MPI applications ‣ Uses Software-Defined Networking (SDN) as a key technology to realize dynamic interconnect control (e.g. dynamic routing) ‣ Successfully accelerated several MPI primitives (e.g. MPI_Bcast, MPI_Allreduce) 3 [4] K. Takahashi et al. “Performance Evaluation of SDN-enhanced MPI_Allreduce on a Cluster System with Fat-tree Interconnect”, HPCS2014. [5] B. Munkhdorj et al. “Design and Implementation of Control Sequence Generator for SDN-enhanced MPI”, NDM’15 [6] S. Date et al.“SDN-accelerated HPC Infrastructure for Scientific Research”, IJIT, vol. 22, no. 01, 2016.
  5. Software-Defined Networking (SDN) 4 Feature Control Plane Data Plane Conventional

    Networking Southbound API (e.g. OpenFlow) Northbound API App App App Control Plane Data Plane Feature Software Defined Networking Disaggregation
  6. Basic Idea of SDN-enhanced MPI 5 Interconnect Computing Nodes 1

    2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq.
  7. Basic Idea of SDN-enhanced MPI 5 Interconnect Computing Nodes 1

    2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Extract via Tracer/Profiler
 Static Analysis
  8. Basic Idea of SDN-enhanced MPI 5 Interconnect Computing Nodes 1

    2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Extract via Tracer/Profiler
 Static Analysis Resource (Path, Bandwidth, etc.) Allocation
  9. Basic Idea of SDN-enhanced MPI 5 Interconnect Computing Nodes 1

    2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Apply using OpenFlow Extract via Tracer/Profiler
 Static Analysis Resource (Path, Bandwidth, etc.) Allocation
  10. Need for a Holistic Analysis in SDN-enhanced MPI 6 2

    4 5 1 0 3 Job Queue j1 j2 j3 j4 Job Scheduling
  11. Need for a Holistic Analysis in SDN-enhanced MPI 6 2

    4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling
  12. Need for a Holistic Analysis in SDN-enhanced MPI 6 2

    4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling
  13. Need for a Holistic Analysis in SDN-enhanced MPI 6 2

    4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection
  14. PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced

    MPI 6 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection
  15. PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced

    MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection
  16. PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced

    MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping
  17. PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced

    MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping
  18. PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced

    MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping
  19. PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced

    MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping
  20. PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced

    MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping Routing
  21. Q1: Impact of the Communication Pattern How does the traffic

    load in the interconnect change for diverse applications? ‣ What kind of application benefits most from SDN-enhanced MPI? ‣ What happens if the number of processes scales out? 7
  22. Q2: Impact of the Cluster Configuration How does the traffic

    load in the interconnect change under diverse clusters with different configurations? ‣ How do job scheduling, node selection and process mapping affect the performance of applications? ‣ How does the topology of the interconnect impact the performance? ‣ What happens if the size of cluster scales out? 8 0 Node Selection (i.e. which node should be allocated to a given job?) Process Placement (i.e. which node should execute a process?) j1 j2 j3 j4 1 2 3 Job Scheduling (i.e. which job should be executed next?)
  23. We aim to develop a toolset to help answer these

    questions ‣ How does the traffic load in the interconnect change for diverse applications? ‣ How does the traffic load in the interconnect change under diverse clusters with different configurations? Simulator-based approach is taken to allow rapid assessment ‣ Requirements for the toolset are summarized as: Requirements for the Interconnect Analysis Toolset 9 1. Support for application-aware dynamic routing 2. Support for communication patterns of real-world applications 3. Support for diverse cluster configurations
  24. Related Work ORCS [7] ‣ Simulates the traffic load of

    each link in the interconnect for a given topology, communication pattern and routing algorithm INAM2 [8] ‣ Comprehensive tool to monitor and analyze network activities in an InfiniBand network PSINS [9] ‣ Trace-driven simulator for predicting the performance of applications on a variety of HPC clusters with different configurations 10 [7] T. Schneider et al., “ORCS: An Oblivious Routing Congestion Simulator”, Indiana University, Computer, no. 675, 2009. [8] H. Subramoni et al. “INAM2: InfiniBand Network Analysis and Monitoring with MPI”, ISC 2016, pp. 300–320 [9] M. M. Tikir et al. “PSINS: An Open Source Event Tracer and Execution Simulator”, HPCMP-UGC 2009, pp. 444–449.
  25. PFProf (profiler) and PFSim (simulator) constitute PFAnalyzer ‣ PFProf -

    Fine-grained MPI profiler for observing network activity caused by MPI function calls (Requirement 2) ‣ PFSim - Lightweight simulator to simulate traffic load in the interconnect targeting application-aware dynamic interconnects
 (Requirement 1, 2, 3) Overview of PFAnalyzer 11 PFSim PFProf Application Profile Result
  26. PFProf: Motivation Existing profilers do not capture the underlying pt2pt

    communication of collective communication ‣ They are designed to support code tuning and optimization, not network traffic analysis. ‣ MPI Profiling Interface (PMPI) only captures individual MPI function calls. 12 1 2 3 4 5 6 7 1 4 3 2 5 7 6 Actual communication performed 0 0 Behavior of MPI_Bcast as seen from applications 12
  27. PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized

    ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free
  28. PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized

    ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Subscribe to PERUSE Events
  29. PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized

    ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Subscribe to PERUSE Events
  30. PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized

    ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events
  31. PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized

    ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events Hook MPI Functions with PMPI
  32. Representation of Communication Pattern ‣ Defined as a matrix T

    of which element Tij is equal to the volume of traffic sent from rank i to rank j ‣ Implies that the volume of traffic between processes as constant during the execution of a job 14 0 50 100 Sender Rank 0 25 50 75 100 125 Receiver Rank 0.0 0.2 0.4 0.6 0.8 1.0 Sent Bytes ⇥108 An example obtained from running the NERSC MILC benchmark with 128 processes The communication pattern of an application is represented using its traffic matrix
  33. PFProf: Overhead Evaluation 15 101 103 105 107 Message Size

    [B] 0.0 0.2 0.4 0.6 0.8 1.0 Relative Throughput 100 101 102 Throughput [MB/s] w/o profiler w/ profiler 101 103 105 107 Message Size [B] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Relative Latency 102 103 104 Latency [µs] w/o profiler w/ profiler Throughput (osu_bw) Latency (osu_latency) Measured throughput and latency of pt2pt communication with and without PFProf using the OSU Microbenchmark
  34. PFSim: Overview 16 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation


    Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf Cluster
 Topology
  35. PFSim: Overview 16 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation


    Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf Cluster
 Topology For Requirement 2
  36. PFSim: Overview 16 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation


    Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf For Requirement 3 For Requirement 3 Cluster
 Topology For Requirement 2
  37. PFSim: Overview 16 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation


    Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf For Requirement 3 For Requirement 1 For Requirement 3 Cluster
 Topology For Requirement 2
  38. PFSim: Architecture 17 Event Event Event Queue j1 j2 j3

    j4 Job Submitted Event Handlers … Job Started Job Finished Job Queue Simulator State Update Interconnect Computing Nodes Event Dispatch
  39. PFSim: Architecture 17 Event Event Event Queue j1 j2 j3

    j4 Job Submitted Event Handlers … Job Started Job Finished Job Queue Simulator State Update Interconnect Computing Nodes Event Dispatch Customized via Plugins
  40. PFSim: Example Input & Output 18 topology: topologies/milk.graphml output: output/milk-cg-dmodk

    algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization
 (Output GraphML visualized with Cytoscape)
  41. PFSim: Example Input & Output 18 topology: topologies/milk.graphml output: output/milk-cg-dmodk

    algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization
 (Output GraphML visualized with Cytoscape) High Traffic Load
  42. PFSim: Example Input & Output 18 topology: topologies/milk.graphml output: output/milk-cg-dmodk

    algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization
 (Output GraphML visualized with Cytoscape) High Traffic Load Less Traffic Load
  43. Simulated Cluster ‣ Modeled after a cluster installed at our

    institution ‣ 20 computing nodes (160 cores) ‣ 2-level fat-tree topology (oversubscription ratio = 2.5) ‣ Switch (NEC PF5240) supports OpenFlow 1.0 (and 1.3) ‣ NAS CG and NERSC MILC are used as workloads 19 Spine Switches Leaf Switches Computing Nodes
  44. Simulated Configurations 20 Node Selection Process Placement Routing Linear Random

    Linear Cyclic D-mod-K Dynamic 0 1 2 3 4 5 0 3 1 4 2 5 Path selected solely based on the destination of flow 0 50 100 Sender Rank 0 25 50 75 100 125 Receiver Rank 0.0 0.2 0.4 0.6 0.8 1.0 Sent Bytes ⇥108 Path allocated based on communication pattern (heavy pairs first)
  45. Simulation Results Maximum traffic load on all links is plotted

    as a performance indicator 21 Linear/Block/DmodK Linear/Block/Dynamic Linear/Cyclic/DmodK Linear/Cyclic/Dynamic Random/Block/DmodK Random/Block/Dynamic Random/Cyclic/DmodK Random/Cyclic/Dynamic 0.0 0.5 1.0 1.5 2.0 2.5 Maximum Tra c (Normalized) Linear/Block/DmodK Linear/Block/Dynamic Linear/Cyclic/DmodK Linear/Cyclic/Dynamic Random/Block/DmodK Random/Block/Dynamic Random/Cyclic/DmodK Random/Cyclic/Dynamic 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Maximum Tra c (Normalized) NAS CG Benchmark (128 ranks) NERC MILC Benchmark (128 ranks) D-mod-K Dynamic
  46. Comparison with Benchmark Results 22 DmodK Dynamic 0.0 0.2 0.4

    0.6 0.8 1.0 Maximum Tra c (Normalized) DmodK Dynamic 0.0 0.2 0.4 0.6 0.8 1.0 Maximum Tra c (Normalized) DmodK Dynamic 0 100 200 300 400 Execution Time [s] DmodK Dynamic 0 100 200 300 400 Execution Time [s] NAS CG Benchmark (128 ranks) NERSC MILC Benchmark (128 ranks) Simulated Maximum
 Traffic Load Execution Time
 on Actual Cluster Simulated Maximum
 Traffic Load Execution Time
 on Actual Cluster 50% 18% 23% 8%
  47. Future Work Integration into SDN-enhanced MPI framework ‣ To realize

    online optimization of interconnect ‣ Can be used for application-aware scheduling and process allocation Improved fidelity ‣ Currently, false-positive hot-spots may be identified due to rough approximation (drop of time-axis information) ‣ Segment profile into multiple distinct communication phases and simulate each phase separately 23
  48. Conclusion ‣ SDN-enhanced MPI is an embodiment of future application-aware

    dynamic interconnects ‣ A tool to rapidly test different interconnect control algorithms is required for the research on SDN-enhanced MPI ‣ Our proposal: PFAnalyzer - PFProf: Collects communication pattern from applications using MPI PERUSE interface - PFSim: Simulates the interconnect in a holistic manner using the communication pattern acquired with PFProf ‣ Preliminary results are obtained that conform benchmark results on actual clusters ‣ Future plan: integrate into SDN-enhanced MPI for online simulation 24