‣ Interconnects can consume up to 50% of total power [1] and 1/3 of total budget of a cluster [2] ‣ Properties such as full bisection bandwidth and non-blocking may become increasingly difficult to achieve ‣ Need to improve the utilization of the interconnect Our proposal is to adopt: ‣ Dynamic (adaptive) routing ‣ Application-awareness network control 2 [1] J. Kim et al.“Flattened Butterfly : A Cost-Efficient Topology for High-Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007. [2] D. Abts et al., “Energy proportional datacenter networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010. [3] S. Kamil, L. Oliker, A. Pinar, and J. Shalf, “Communication Requirements and Interconnect Optimization for High-End Scientific Applications,” IEEE Trans. Parallel Distrib. Syst., vol. 21, no. 2, pp. 188–202, 2010.
MPI [4,5,6] ‣ Dynamically controls the interconnect based on the communication pattern of MPI applications ‣ Uses Software-Defined Networking (SDN) as a key technology to realize dynamic interconnect control (e.g. dynamic routing) ‣ Successfully accelerated several MPI primitives (e.g. MPI_Bcast, MPI_Allreduce) 3 [4] K. Takahashi et al. “Performance Evaluation of SDN-enhanced MPI_Allreduce on a Cluster System with Fat-tree Interconnect”, HPCS2014. [5] B. Munkhdorj et al. “Design and Implementation of Control Sequence Generator for SDN-enhanced MPI”, NDM’15 [6] S. Date et al.“SDN-accelerated HPC Infrastructure for Scientific Research”, IJIT, vol. 22, no. 01, 2016.
load in the interconnect change for diverse applications? ‣ What kind of application benefits most from SDN-enhanced MPI? ‣ What happens if the number of processes scales out? 7
load in the interconnect change under diverse clusters with different configurations? ‣ How do job scheduling, node selection and process mapping affect the performance of applications? ‣ How does the topology of the interconnect impact the performance? ‣ What happens if the size of cluster scales out? 8 0 Node Selection (i.e. which node should be allocated to a given job?) Process Placement (i.e. which node should execute a process?) j1 j2 j3 j4 1 2 3 Job Scheduling (i.e. which job should be executed next?)
questions ‣ How does the traffic load in the interconnect change for diverse applications? ‣ How does the traffic load in the interconnect change under diverse clusters with different configurations? Simulator-based approach is taken to allow rapid assessment ‣ Requirements for the toolset are summarized as: Requirements for the Interconnect Analysis Toolset 9 1. Support for application-aware dynamic routing 2. Support for communication patterns of real-world applications 3. Support for diverse cluster configurations
each link in the interconnect for a given topology, communication pattern and routing algorithm INAM2 [8] ‣ Comprehensive tool to monitor and analyze network activities in an InfiniBand network PSINS [9] ‣ Trace-driven simulator for predicting the performance of applications on a variety of HPC clusters with different configurations 10 [7] T. Schneider et al., “ORCS: An Oblivious Routing Congestion Simulator”, Indiana University, Computer, no. 675, 2009. [8] H. Subramoni et al. “INAM2: InfiniBand Network Analysis and Monitoring with MPI”, ISC 2016, pp. 300–320 [9] M. M. Tikir et al. “PSINS: An Open Source Event Tracer and Execution Simulator”, HPCMP-UGC 2009, pp. 444–449.
communication of collective communication ‣ They are designed to support code tuning and optimization, not network traffic analysis. ‣ MPI Profiling Interface (PMPI) only captures individual MPI function calls. 12 1 2 3 4 5 6 7 1 4 3 2 5 7 6 Actual communication performed 0 0 Behavior of MPI_Bcast as seen from applications 12
‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free
‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Subscribe to PERUSE Events
‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Subscribe to PERUSE Events
‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events
‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events Hook MPI Functions with PMPI
of which element Tij is equal to the volume of traffic sent from rank i to rank j ‣ Implies that the volume of traffic between processes as constant during the execution of a job 14 0 50 100 Sender Rank 0 25 50 75 100 125 Receiver Rank 0.0 0.2 0.4 0.6 0.8 1.0 Sent Bytes ⇥108 An example obtained from running the NERSC MILC benchmark with 128 processes The communication pattern of an application is represented using its traffic matrix
0.6 0.8 1.0 Maximum Tra c (Normalized) DmodK Dynamic 0.0 0.2 0.4 0.6 0.8 1.0 Maximum Tra c (Normalized) DmodK Dynamic 0 100 200 300 400 Execution Time [s] DmodK Dynamic 0 100 200 300 400 Execution Time [s] NAS CG Benchmark (128 ranks) NERSC MILC Benchmark (128 ranks) Simulated Maximum Traffic Load Execution Time on Actual Cluster Simulated Maximum Traffic Load Execution Time on Actual Cluster 50% 18% 23% 8%
online optimization of interconnect ‣ Can be used for application-aware scheduling and process allocation Improved fidelity ‣ Currently, false-positive hot-spots may be identified due to rough approximation (drop of time-axis information) ‣ Segment profile into multiple distinct communication phases and simulate each phase separately 23
dynamic interconnects ‣ A tool to rapidly test different interconnect control algorithms is required for the research on SDN-enhanced MPI ‣ Our proposal: PFAnalyzer - PFProf: Collects communication pattern from applications using MPI PERUSE interface - PFSim: Simulates the interconnect in a holistic manner using the communication pattern acquired with PFProf ‣ Preliminary results are obtained that conform benchmark results on actual clusters ‣ Future plan: integrate into SDN-enhanced MPI for online simulation 24