Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An MPI Framework for HPC Clusters Deployed with...

An MPI Framework for HPC Clusters Deployed with Software-Defined Networking

This talk will present our ongoing work on SDN-enhanced MPI, an MPI framework for HPC clusters equipped with Software-Defined Networking (SDN). The goal of this framework is to improve inter-process communication performance and network utilization through dynamically reconfiguring the interconnect based on the communication pattern of applications. We present the overview of SDN-enhanced MPI and summarize recent achievements including several accelerated MPI collectives as well as a network profiling tool for use with our framework.

Keichi Takahashi

March 22, 2018
Tweet

More Decks by Keichi Takahashi

Other Decks in Research

Transcript

  1. The 27th Workshop on Sustained Simulation Performance (WSSP27) An MPI

    Framework for HPC Clusters 
 Deployed with Software-Defined Networking Keichi Takahashi, Khureltulga Dashdavaa, Susumu Date,
 Yoshiyuki Kido, Shinji Shimojo Cybermedia Center, Osaka University
  2. The 27th Workshop on Sustained Simulation Performance (WSSP27) Scale-out of

    Interconnects Interconnects are becoming increasingly larger and complex ‣ since the number of nodes it has to accommodate is increasing ‣ can consume up to 50% of total power [1] and 1/3 of total budget [2] of a cluster 2 Number of Cores of Top500 Systems 1E+02 1E+04 1E+06 1E+08 2002/06 2004/06 2006/06 2008/06 2010/06 2012/06 2014/06 2016/06 1st 10th 100th [1] J. Kim et al.“Flattened Butterfly : A Cost-Efficient Topology for High- Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007. [2] D. Abts et al., “Energy proportional datacenter networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010.
  3. The 27th Workshop on Sustained Simulation Performance (WSSP27) Mainstream Design

    of Interconnects Network resources are statically allocated ‣ Routing, bandwidth allocation, topology etc. are fixed ‣ Simple and easy to implement ‣ Consequently, unaware of the communication pattern of applications Over-provisioned ‣ Redundant links and bandwidth are provisioned ‣ To assure the communication performance of diverse applications with different communication patterns ‣ However, over-provisioning is becoming more and more expensive due to the rapid scale-out of clusters 3
  4. The 27th Workshop on Sustained Simulation Performance (WSSP27) Drawbacks of

    Static Interconnects 4 Inter-process communication pattern of applications 1. Load imbalance among links 2. Low utilization of network resources 3. Low inter-node communication performance Interconnect Mismatch Drawbacks
  5. The 27th Workshop on Sustained Simulation Performance (WSSP27) Basic Concept

    of the Framework 5 0 1 3 2 4 5 7 6 Cluster Communication Pattern 0 2 5 7 1 3 4 6 Interconnect Configuration Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Apply Profile, trace, static analysis, etc. Plan & optimize 0 Switch Server Process
  6. The 27th Workshop on Sustained Simulation Performance (WSSP27) Temporal Granularity

    of Reconfiguration Per-job ‣ Pros: Relatively simple to implement ‣ Cons: Limited effect on applications with time-varying communication patterns Per-primitive ‣ Pros: Fine-grained control, can support time-varying comm patterns ‣ Cons: Potentially high overhead, needs intricate mechanism to synchronize application execution and interconnect control Per-packet (a.k.a adaptive routing) ‣ Pros: Works without prior knowledge of application ‣ Cons: Unable to utilize global view of network 6
  7. The 27th Workshop on Sustained Simulation Performance (WSSP27) Overview 7

    1 0 3 2 5 4 7 6 Cluster Communication Pattern 0 2 5 7 1 3 4 6 Profile, trace, static analysis, etc. Interconnect Configuration Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Apply Plan & optimize
  8. The 27th Workshop on Sustained Simulation Performance (WSSP27) PFProf |

    MPI Profiler An MPI profiler focused on network activity monitoring ‣ outputs traffic matrices and other network statistics as JSON file ‣ can detect underlying communication of collective MPI functions ‣ implemented based on PMPI and PERUSE 8 1 2 3 4 5 6 7 1 4 3 2 5 7 6 Actual communication performed (when using binomial tree algorithm) 0 0 Behavior of MPI_Bcast as seen from applications Keichi Takahashi et al. "PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects", HPCMASPA 2017
  9. The 27th Workshop on Sustained Simulation Performance (WSSP27) Overview 9

    1 0 3 2 5 4 7 6 Cluster Communication Pattern Interconnect Configuration 0 2 5 7 1 3 4 6 Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Apply Profile, trace, static analysis, etc. Plan & optimize
  10. The 27th Workshop on Sustained Simulation Performance (WSSP27) Path Allocation

    Algorithm Objective: Maximize load balance and minimize congestion on links ‣ Currently, a simple greedy heuristic is adopted ‣ Finding the optimal load balancing of multiple flows is an NP-complete problem (variation of multi-commodity flow problem) 10 0 2 1 3 1 2 2 3 2 3 3 0 2 2 1 3 2 0 1 1 Order by traffic volume Allocate path 0 2 1 3
  11. The 27th Workshop on Sustained Simulation Performance (WSSP27) Overview 11

    0 1 3 2 4 5 7 6 Cluster Communication Pattern 0 2 5 7 1 3 4 6 Interconnect Configuration Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Apply Profile, trace, static analysis, etc. Plan & optimize
  12. The 27th Workshop on Sustained Simulation Performance (WSSP27) SDN |

    Software-Defined Networking 12 Feature Control Plane Data Plane Conventional Networking Southbound API (e.g. OpenFlow) Northbound API App App App Control Plane Data Plane Feature Software Defined Networking Disaggregation
  13. The 27th Workshop on Sustained Simulation Performance (WSSP27) OpenFlow |

    Standard Implementation of SDN 13 Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Control Plane Data Plane Add/Modify/Delete flow entries Inject packets into data plane Notify flow entry misses Flow Table (Collection of flow entries) OpenFlow Controller OpenFlow Messages
  14. The 27th Workshop on Sustained Simulation Performance (WSSP27) Overview 14

    1 0 3 2 5 4 7 6 Cluster Communication Pattern Interconnect Configuration 0 2 5 7 1 3 4 6 Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Apply Profile, trace, static analysis, etc. Plan & optimize OpenFlow
  15. The 26th Workshop on Sustained Simulation Performance (WSSP26) Putting Them

    Altogether 15 1 0 3 2 5 4 7 6 OpenFlow
 Controller Job
 Scheduler Start, monitor and terminate jobs Reconfigure interconnect Integration Oversees Communication Oversees Computation
  16. The 27th Workshop on Sustained Simulation Performance (WSSP27) Overall Architecture

    16 Head Node Compute Node slurmctld Launch sbatch, srun, … slurmdbd Job app Interconnect
 Manager Interconnect Manager Node Communication Pattern DB Flow Entries OpenFlow
 Controller PfProf slurmd Job Info Plugin User Developed Components Existing Components Process
 Placement Plugin Interconnect
  17. The 27th Workshop on Sustained Simulation Performance (WSSP27) Evaluated the

    execution time of a communication-intensive benchmark with and without using our framework ‣ Cluster of 20 compute nodes each equipped with 1CPU (8 cores) ‣ Interconnect is a 2 level fa-tree with 2.5:1 oversubscription ratio ‣ NAS CG benchmark with 128 processes was used as the workload ‣ Different node selection and process placement strategies Preliminary Evaluation 17
  18. The 27th Workshop on Sustained Simulation Performance (WSSP27) Evaluation Results

    ‣ SDN achieves consistently better communication performance than static routing (D-mod-K) ‣ Process placement and node selection has significant effect on the performance 18 Communication Time [s] 0 175 350 525 700 Process Allocation Block A Block B Block C Block D Block E Cyclic SDN D-mod-K 26% Reduction
  19. The 27th Workshop on Sustained Simulation Performance (WSSP27) Conclusion Summary

    ‣ A framework that dynamically reconfigures the interconnect to match the communication pattern of MPI applications is proposed ‣ The proposed framework integrates the interconnect controller into 
 the job scheduler ‣ Evaluation indicates improvement in communication performance Future Directions ‣ Extensive benchmark evaluation using diverse applications ‣ Combine application-aware process placement and node selection ‣ Adopt sophisticated path allocation algorithms 19