Performance analysis of mdx II: A next-generation cloud platform for cross- disciplinary data science research

15th International Conference on Cloud Computing and Services Science (CLOSER
2025) Performance analysis of mdx II: A next-generation cloud platform for cross- disciplinary data science research Keichi Takahashi, Tomonori Hayami, Yu Mukaizono, Yuki Teramae, Susumu Date D3 Center, The University of Osaka, Japan

Shortcomings of traditional HPC systems • Traditional high-performance computing systems
faces various challenges when dealing with recent data science and machine learning workloads. ◦ Limited hardware and software customizability. ◦ Lack of support for long-running jobs. ◦ Limited connectivity with outside data sources and storages. 2 🤔 I want to install software A but it requires root. My analysis tool B runs on Windows only. 🧐 I want to publish my dataset along with a web frontend. I want to stream data from experimental facilities. 🤨

The mdx cloud platform • A cloud platform jointly procured
and operated by nine Japanese universities and institutes, aiming at: ◦ Creating fl exible computing environments ◦ Creating secure isolated virtual environments ◦ Enabling seamless coupling with data sources ◦ Accommodating real-time/urgent jobs • The fi rst implementation, mdx I, was deployed at the University of Tokyo in 2021. 3 Overview of mdx I [1] [1] T. Suzumura et al., “mdx: A Cloud Platform for Supporting Data Science and Cross-Disciplinary Research Collaborations,” DASC/PiCom/CBDCom/CyberSciTech, 2022.

mdx II • The second implementation of the mdx concept
aimed at introducing the latest hardware and enabling geo-redundancy and fault tolerance. • Deployed at the University of Osaka in 2024 and started its service in Nov. 2024. • Is based on OpenStack, but integrates various components such as parallel fi le system, object storage, fi le hosting interface, and federated authentication. 4

Contributions of this study • We detail the design and
implementation of the mdx II cloud platform as an example of a state-of-the-art academic cloud. • We conduct a comprehensive performance evaluation of mdx II to o ff er quantitative performance data, and reveal strengths and weaknesses. • Through performance analysis, we identify performance bottlenecks and discuss design trade-o ff s to provide insights for future academic cloud designs. 5

Architecture of mdx II 6 CPU nodes ×60 CPU CPU
NIC RAM RAM GPU nodes ×7 CPU CPU NIC RAM RAM GPU GPU NIC Ethernet 200Gbps GPU GPU NFS S3DS Nextcloud Object storage 432TB Lustre 553TB Mgmt. Servers Internet SINET6

Rack exterior 7

Node architecture 8 CPU nodes GPU nodes * CPU Intel
Xeon Platinum 8480+ (56 cores) x2 Intel Xeon Gold 6530 (32 cores) x2 Memory 512 GiB (DDR5-4800 SDRAM ) 1024 GiB (DDR5-5600 SDRAM ) GPU N/A NVIDIA H200 SXM5 x4 Network 200 Gbps Ethernet x1 200 Gbps Ethernet x2 # of nodes 60 7 CPU node (20 nodes/8U) GPU node (2U) * Under deployment as of now and not included in this study.

Storage design 9 Interface Capacity Purpose Block POSIX 100 TB
VM local storage Lustre POSIX 453 TB High-performance and parallel I/O S3DS S3-compatible API High-performance I/O HyperStore S3-compatible API 432 TB Archival, ingestion, and publishing /dev/vda /lustre /nfs virtio-blk Host Guest Lustre NFS Server S3DS Server EXAScaler HyperStore S3 S3 on EXAScaler on HyperStore

Evaluation setup • Evaluations are carried out using a 16-vCPU
VM and a 224-vCPU VM. • 16-vCPU: Comparison with public cloud ( fl oating point computing performance, memory throughput, network throughput, storage performance, app performance) • 224-vCPU: Comparison with bare metal ( fl oating point computing performance, memory throughput, app performance) 10 mdx II 16-vCPU VM AWS c7i.4xlarge CPU 16 vCPUs (Intel Xeon 8480+) 16 vCPUs (Intel Xeon 8488C) Memory 32 GiB 32 GiB Network B/W No tra ffi c shaping 12.5 Gbps mdx II 224-vCPU VM Kyoto University Laurel 3 CPU 224 vCPUs (Intel Xeon 8480+) Intel Xeon 8480+ x2 Memory 498 GiB 512 GiB Network B/W 200 Gbps 200 Gbps

Compute and memory performance (16-vCPU) • Use HPL (included in
Intel oneAPI Toolkit1) to measure fl oating-point computing performance and BabelStream2 to measure memory throughput. • Floating point compute performance is 2x that of AWS c7i.4xlarge. ◦ AWS pins each vCPU to a logical core, while mdx II does not. • Memory throughput is 1.7x higher than that of AWS. ◦ Likely also due to weaker isolation than AWS. 11 1 https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit.html 2 https://github.com/UoB-HPC/BabelStream Compute Memory mdx II 1344 GFLOPS 164 GB/s c7i.4xlarge 656 GFLOPS 97 GB/s

Network throughput • iPerf 3.18 is used to measure the
aggregate TCP throughput between two VMs. • The zero-copy (-Z) option is enabled to reduce kernel-user memory copies. • Throughput does not scale with the number of streams. 12 0 20 40 60 80 100 1 2 4 8 16 Inter-node default Intra-node default Total Throughput [Gbps] # of Streams Host 1 Host 2 Inter-node test Intra-node test N TCP streams 16.7 Gbps 31.7 Gbps Guest iperf -c Guest iperf -s Host 1 N TCP streams Guest iperf -c Guest iperf -s

virtio-net (with vhost-net) • virtio-net is a paravirtualized network driver
used by default in OpenStack. • virtio-net and vhost-net create a shared ring bu ff er in memory to communicate I/O, reducing context switches and improving performance. • virtio multiqueue creates multiple queues between the virtio-net and vhost-net, enabling parallelized packet transfers between guest and host. 13 Host Kernel User KVM virtio-net Application QEMU User space Kernel space vhost-net

Network throughput with tuning • Enabling virtio multiqueue allows the
throughput to scale with the number of concurrent TCP streams. • The latency increases by up to 15%, but considered a reasonable trade-o ff . 14 0 10 20 30 40 50 60 70 80 90 100 Intra-node Inter-node w/o multiqueue w/ multiqueue Round-trip Latency [µs] 0 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 Inter-node Intra-node Total Throughput [Gbps] # of Streams Latency Throughput 93 Gbps 80 Gbps +15%

Aggregate network throughput between nodes • The total throughput between
multiple pairs of VMs is measured (16 streams per VM). • Total throughput between two nodes saturates at 126 Gbps (physical bandwidth is 200 Gbps). ◦ Performance is likely limited due to the various overheads of virtual networking. ◦ GPU nodes will have SR-IOV enabled. 15 0 20 40 60 80 100 120 140 160 180 1 2 3 4 Inter-node Intra-node Total Throughput [Gbps] # of VMs 126 Gbps Host 1 Host 2 Guest N Guest N 16 TCP streams iperf -c iperf -s

Storage I/O performance • fi o1 3.38 is used to
measure the sequential and random I/O performance (see paper for detailed con fi gurations). • mdx II block storage achieves better performance than AWS, but is limited by the Lustre-NFS gateway and virtio. • Lustre o ff ers even higher performance, and is bottlenecked the guest network performance. 16 Read Write mdx II (Block) 4.21 GB/s 1.75 GB/s mdx II (Lustre) 9.82 GB/s 7.74 GB/s AWS (Block) 1.05 GB/s 1.05 GB/s Read Write mdx II (Block) 61 KIOPS 21 KIOPS mdx II (Lustre) 416 KIOPS 164 KIOPS AWS (Block) 16 KIOPS 16 KIOPS Sequential I/O Random I/O 1 https://github.com/axboe/ fi o

Lustre performance • IOR1 benchmark is used to measure the
total performance of Lustre when accessed from multiple VMs. • Total throughput saturates at 15 GB/s, while the designed peak throughput is ~50 GB/s . • Performance could be limited by TCP/IP, and achieving peak performance might require RDMA (e.g., RoCE). 17 0 2 4 6 8 10 12 14 16 1 2 3 4 5 6 7 8 Read Write Throughput [GiB/s] Number of VMs 1 https://github.com/hpc/ior

Cloudian HyperStore • warp1 1.0.8 is used to measure the
throughput of Cloudian HyperStore. • Benchmark uploads (PUT) or downloads (GET) 2500 objects each of which is 10 MiB in size. • Throughput saturates at 1.12 GB/s. ◦ Link bandwidth between HyperStore and mdx II is limited to 10 Gbps. ◦ HyperStore should not be used for performance-critical workloads. 18 0 200 400 600 800 1000 1200 1 2 4 8 16 32 64 128 256 GET PUT Throughput [MiB/s] Concurrency 1https://github.com/minio/warp

S3 Data Services • warp is used to measure the
throughput of Lustre accessed via S3DS. • Single client throughput is higher than that of Cloudian HyperStore. • With enough concurrency (128), S3DS can saturate the guest network throughput. • PUT exhibits poor scalability and performance compared to GET, needs further investigation. 19 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1 2 4 8 16 32 64 128 256 GET PUT Throughput [MiB/s] Concurrency 7.6x gap

Polars Decision Support (PDS) Benchmarks • PDS is a port
of the TPC-H OLAP benchmark to various Python libraries. • Some performance di ff erences between mdx and AWS exist, but the di ff erences between libraries/frameworks are larger. • Evaluation with larger scales and performance pro fi ling are future work. 20 0 5 10 15 20 25 1 2 3 4 5 6 7 mdx II EC2 Runtime [s] Query Number 0 0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 mdx II EC2 Runtime [s] Query Number 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 mdx II EC2 Runtime [s] Query Number 0 0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 mdx II EC2 Runtime [s] Query Number Pandas Polars DuckDB Dask

Compute and memory performance (224-vCPU) • No measurable overhead in
computing performance. • Up to 22% decrease in memory throughput. • Potential reasons behind the overhead are: ◦ Cross NUMA tra ff i c (vCPUs are not pinned and thus could move between sockets). ◦ Address translation (huge pages are not used in mdx II). 21 Compute Memory mdx II 4965 GFLOPS 383 GB/s Bare metal 4819 GFLOPS 490 GB/s

SPEChpc 2021 • SPEChpc1 is an HPC application benchmark suite
developed by the SPEC group. • The “tiny” suite designed for a single node is used to compare mdx and bare metal. • Memory-intensive benchmarks perform worse compared to bare metal. 22 0 5 10 15 20 25 LBM SOMA TeaLeaf CloverLeaf miniSweep POT3D SPH-EXA HPGMG-FV miniWeather Bare-metal mdx II Speedup over Baseline System 1 https://www.spec.org/hpc2021/

Discussion • VM performance and system utilization is a trade-o
ff ◦ e.g., vCPU pinning and huge pages improve performance but disable overcommitting of cores and memory, respectively. ◦ Since the current design is leaned towards higher utilization, performance needs to be continuously studied as the system becomes more congested. • VM network performance is critical ◦ virtio multiqueue is an e ff ective optimization, but still TCP/IP and network virtualization impose large overhead in fully utilizing the host bandwidth (200Gbps). ◦ Will investigate optimizations such as SR-IOV and RoCE, but they break/limit compatibility with mdx I and public clouds, e.g., inability to migrate VMs. 23

Conclusions and Future Work • An extensive performance evaluation and
analysis of mdx II was carried out. • Evaluation results demonstrated that mdx II o ff ers equal or better performance than its competitors, especially in storage I/O performance. • Several weak points, such as network and memory performance, were identi fi ed, and potential optimizations were discussed. • Future work includes further evaluation using real-world data science workloads, platform performance optimization, and evaluation of H200 GPU nodes. 24 This work was partially supported by JST ACT-X Grant Number JPMJAX24M6, as well as JSPS KAKENHI Grant Numbers JP20K19808 and JP23K16890. The mdx II system was used to carry out experiments.

Performance analysis of mdx II: A next-generati...

Performance analysis of mdx II: A next-generation cloud platform for cross- disciplinary data science research

Keichi Takahashi

More Decks by Keichi Takahashi

Other Decks in Research

Featured

Transcript

15th International Conference on Cloud Computing and Services Science (CLOSER

Shortcomings of traditional HPC systems • Traditional high-performance computing systems

The mdx cloud platform • A cloud platform jointly procured

mdx II • The second implementation of the mdx concept

Contributions of this study • We detail the design and

Architecture of mdx II 6 CPU nodes ×60 CPU CPU

Rack exterior 7

Node architecture 8 CPU nodes GPU nodes * CPU Intel

Storage design 9 Interface Capacity Purpose Block POSIX 100 TB

Evaluation setup • Evaluations are carried out using a 16-vCPU

Compute and memory performance (16-vCPU) • Use HPL (included in

Network throughput • iPerf 3.18 is used to measure the

virtio-net (with vhost-net) • virtio-net is a paravirtualized network driver

Network throughput with tuning • Enabling virtio multiqueue allows the

Aggregate network throughput between nodes • The total throughput between

Storage I/O performance • fi o1 3.38 is used to

Lustre performance • IOR1 benchmark is used to measure the

Cloudian HyperStore • warp1 1.0.8 is used to measure the

S3 Data Services • warp is used to measure the

Polars Decision Support (PDS) Benchmarks • PDS is a port

Compute and memory performance (224-vCPU) • No measurable overhead in

SPEChpc 2021 • SPEChpc1 is an HPC application benchmark suite

Discussion • VM performance and system utilization is a trade-o

Conclusions and Future Work • An extensive performance evaluation and