Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will the proposed architecture perform with various types of applications and clusters? Agenda 2
Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will the proposed architecture perform with various types of applications and clusters? Agenda 2
Future Interconnects Over-provisioned designs might not scale well ‣ Interconnects can consume up to 50% of total power [1] and 1/3 of total budget of a cluster [2] ‣ Properties such as full bisection bandwidth and non-blocking may become increasingly difficult to achieve ‣ Need to improve the utilization of the interconnect Our proposal is to adopt: ‣ Dynamic (adaptive) routing ‣ Application-awareness network control 3 [1] J. Kim et al.“Flattened Butterfly : A Cost-Efficient Topology for High-Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007. [2] D. Abts et al., “Energy proportional datacenter networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010.
Inefficiency in Current Interconnect 4 Communication Pattern of Applications Topology of the Interconnect [3] S. Kamil et al. ,“Communication Requirements and Interconnect Optimization for High-End Scientific Applications,” IEEE Trans. Parallel Distrib. Syst., 2010. Lower utilization Higher congestion Lower communication performance [3] Mismatch
Framework Our prototype framework that integrates SDN into MPI ‣ Dynamically controls the interconnect based on the communication pattern of MPI applications ‣ Uses Software-Defined Networking (SDN) as a key technology to realize dynamic interconnect control (e.g. dynamic routing) ‣ Successfully accelerated several MPI primitives (e.g. MPI_Bcast and MPI_Allreduce) 5
Software-Defined Networking 6 Feature Control Plane Data Plane Conventional Networking Southbound API (e.g. OpenFlow) Northbound API App App App Control Plane Data Plane Feature Software Defined Networking Disaggregation
Standard implementation of SDN 7 Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Control Plane Data Plane Add/Modify/Delete flow entries Inject packets into data plane Notify flow entry misses Flow Table (Collection of flow entries) OpenFlow Controller OpenFlow Messages
SDN-enhanced InfiniBand (Lee et al. SC16) ‣ Enhancement to InfiniBand that allows dynamic and per-flow level network control Conditional OpenFlow (Benito et al. HiPC 2015) ‣ Enhanced OpenFlow that allows users to add flow entries that are activated when an Ethernet Pause (IEEE 802.3x) occurs ‣ Primary goal is to implement non-minimal adaptive routing on Ethernet Quantized Congestion Notification Switch (Benito et al. HiPINEB 2017) ‣ Another enhancement to OpenFlow that uses received QCNs (802.1 Qau Quantized Congestion Notification) to probabilistically determine which path to select 9
Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Can we accelerate MPI primitives based on our idea? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will our idea on various types of applications and clusters? Agenda 10
Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will our idea on various types of applications and clusters? Agenda 12
Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will our idea on various types of applications and clusters? Agenda 12
and Communication 13 #include <mpi.h> int main() { MPI_Init(&argc, &argv); MPI_Bcast(buf, count, …); /* … */ MPI_Allreduce(buf, count, …); MPI_Finalize(); } Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Time-varying Communication Pattern Reconfiguration of the Interconnect How do we synchronize these two? Execution
Embed encoded MPI envelope into each packet ‣ Current implementation uses virtual MAC addresses to represent tags UnisonFlow 14 MPI Packet Tag Custom Kernel Module Kernel Space User Space MPI Library MPI Application ioctl MPI Packet Tag Instructions A Output to port X B Output to port Y … … MPI MPI Packet Tag Embed Packet Flow Controlled Based on Tag Value MPI Envelope: Rank, Primitive Type, Communicator, etc. [5] Keichi Takahashi, "Concept and Design of SDN-enhanced MPI Framework", EWSDN 2015
Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will our idea on various types of applications and clusters? Agenda 15
Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will our idea on various types of applications and clusters? Agenda 15
of the Communication Pattern How does the traffic load in the interconnect change for diverse applications? ‣ What kind of application benefits most from SDN-enhanced MPI? ‣ What happens if the number of processes scales out? 17
of the Cluster Configuration How does the traffic load in the interconnect change under diverse clusters with different configurations? ‣ How do job scheduling, node selection and process mapping affect the performance of applications? ‣ How does the topology of the interconnect impact the performance? ‣ What happens if the size of cluster scales out? 18 0 Node Selection (i.e. which node should be allocated to a given job?) Process Placement (i.e. which node should execute a process?) j1 j2 j3 j4 1 2 3 Job Scheduling (i.e. which job should be executed next?)
to develop a toolset to help answer these questions ‣ How does the traffic load in the interconnect change for diverse applications? ‣ How does the traffic load in the interconnect change under diverse clusters with different configurations? Simulator-based approach is taken to allow rapid assessment ‣ Requirements for the toolset are summarized as: Requirements for the Interconnect Analysis Toolset 19 1. Support for application-aware dynamic routing 2. Support for communication patterns of real-world applications 3. Support for diverse cluster configurations
Existing profilers do not capture the underlying pt2pt communication of collective communication ‣ They are designed to support code tuning and optimization, not network traffic analysis. ‣ MPI Profiling Interface (PMPI) only captures individual MPI function calls. 21 1 2 3 4 5 6 7 1 4 3 2 5 7 6 Actual communication performed 0 0 Behavior of MPI_Bcast as seen from applications 21
MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 22 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free
MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 22 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Subscribe to PERUSE Events
MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 22 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Subscribe to PERUSE Events
MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 22 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events
MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 22 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events Hook MPI Functions with PMPI
Communication Pattern ‣ Defined as a matrix T of which element Tij is equal to the volume of traffic sent from rank i to rank j ‣ Implies that the volume of traffic between processes as constant during the execution of a job 23 0 50 100 Sender Rank 0 25 50 75 100 125 Receiver Rank 0.0 0.2 0.4 0.6 0.8 1.0 Sent Bytes ⇥108 An example obtained from running the NERSC MILC benchmark with 128 processes The communication pattern of an application is represented using its traffic matrix
Simulation-based study of large-scale clusters with different topologies ‣ Currently, our institution owns only a small-scale experimental cluster employed with SDN Integrate interconnect controller with scheduler and MPI runtime ‣ To support multiple jobs running in parallel ‣ To investigate the effect of node allocation and process placement Better application-aware routing algorithms ‣ Currently, a simple greedy like algorithm is used ‣ How about optimization or machine learning? 30
static and over-provisioned interconnects might not scale well ‣ SDN allows us to build a more dynamic and application-aware interconnects ‣ Such architecture could improve the utilization of the interconnect and communication performance Our achievements so far include: ‣ SDN-accelerated MPI primitives such as Bcast and Allreduce ‣ UnisonFlow, a coordination mechanism of computation and communication ‣ PFAnalyzer, a toolset for analyzing application-aware dynamic interconnects 31