will be NVML and nvidia-smi NVML: NVIDIA Management Library Query state and configure GPU C, Perl, and Python API nvidia-smi: Command-line client for NVML GPU Deployment Kit: includes NVML headers, docs, and nvidia-healthmon
is PCIe topology Direct memory access between devices w/ P2P transfers Unified addressing for system and GPUs Works best when all devices are on same PCIe root or switch System Memory CPU GPU0 GPU0 Memory GPU1 GPU1 Memory 0x0000 0xFFFF
x16 x16 PCI-e P2P communication supported between GPUs on the same IOH x16 x16 P2P on dual-socket servers CPU0 CPU1 Incompatible with PCI-e P2P specification GPU0 GPU1 GPU2 GPU3
heavily: — More GPUs per node are better — Choose servers with appropriate PCIe topology — Tune application to do transfers within PCIe complex If apps don’t use P2P: — May be dominated by host <-> device data transfers — More servers with fewer GPUs/server
x16 x16 x16 x16 GPU0 GPU1 GPU2 Switch 0 Switch 1 Other Device • PCIe switches fully supported for all operations • Best P2P performance between devices on same switch • P2P also supported with other devices such as NIC via GPUDirect RDMA
and GPU without touching host memory • Greatly improved performance • Currently supported on Cray (XK7 and XC-30) and Mellanox FDR Infiniband • Some MPI implementations support GPUDirect RDMA
Many servers ship with 64-bit PCIe addressing turned off — Needs to be turned on for Tesla K40 or systems with many GPUs — Might be called “Enable 4G Decoding” or similar Configure for cooling passive GPUs — Tesla M-series has passive cooling – relies on system fans — Communicates thermals to BMC to manage fan speed — Make sure BMC firmware is up to date, fans are configured correctly Make sure remote console uses onboard VGA, not “offboard” NVIDIA GPU
will conflict with NVIDIA driver Two steps to disable: 1. Edit /etc/modprobe.d/disable-nouveau.conf: blacklist nouveau nouveau modeset=0 2. Rebuild initial ramdisk: RHEL: dracut --force SUSE: mkinitrd Deb: update-initramfs -u
Command-line installer — Bundled with CUDA toolkit – developer.nvidia.com/cuda — Stand-alone – www.nvidia.com/drivers RPM/DEB — Provided by NVIDIA (major versions only) — Provided by Linux distros (other release schedule) Not easy to switch between these methods
runlevel 3 so you should initialize the GPU explicitly in an init script At minimum: — Load kernel modules – nvidia + nvidia_uvm (in CUDA 6) — Create devices with mknod Optional steps: — Configure compute mode — Set driver persistence — Set power limits
2.1 (beta) has support for GPUDirect RDMA — Should also be supported on Cray systems for CLE <…> HW required: Mellanox FDR HCAs, Tesla K10/K20/K20X/K40 SW required: NVIDIA driver 331.20 or better, CUDA 5.5 or better, GPUDirect plugin from Mellanox Enables an additional kernel driver, nv_peer_mem
idle Driver must re-load when job starts, slowing startup If ECC is on, memory is cleared between jobs Persistence daemon keeps driver loaded when GPUs idle: # /usr/bin/nvidia-persistenced --persistence-mode \ [--user <username>] Faster job startup time Slightly lower idle power
— Correctable errors are logged but not scrubbed — Uncorrectable errors cause error at user and system level — GPU rejects new work after uncorrectable error, until reboot ECC can be turned off – makes more GPU memory available at cost of error correction/detection — Configured using NVML or nvidia-smi # nvidia-smi -e 0 — Requires reboot to take effect
set with NVML/nvidia-smi Set on a per-GPU basis Useful in power-constrained environments nvidia-smi –pl <power in watts> Settings don’t persist across reboots – set this in your init script Requires driver persistence
integration in resource manager and MPI Set up GPU process accounting to measure usage Configure GPU Boost clocks (or allow users to do so) Managing job topology on GPU compute nodes
integration features available: SLURM, Torque, PBS Pro, Univa Grid Engine, LSF GPU status monitoring: — Report current config, load sensor for utilization Managing process topology: — GPUs as consumables, assignment using CUDA_VISIBLE_DEVICES — Set GPU configuration on a per-job basis Health checks: — Run nvidia-healthmon or integrate with monitoring system NVIDIA integration usually configured at compile time (open source) or as a plugin
libraries support sending/receiving directly from CUDA device memory OpenMPI 1.7+, mvapich2 1.8+, Platform MPI, Cray MPT Typically needs to be enabled for the MPI at compile time Depending on version and system topology, may also support GPUDirect RDMA Non-CUDA apps can use the same MPI without problems (but might link libcuda.so even if not needed) Enable this in MPI modules provided for users
1.20 1.40 AMBER SPFP-TRPCage LAMMPS-EAM NAMD 2.9-APOA1 Tesla K40 (base) Tesla K40 with GPU Boost Use Power Headroom to Run at Higher Clocks 25% Faster 20% Faster 14% Faster 17% Faster 13% Faster 11% Faster
GPUs manage multiple CUDA contexts 0/DEFAULT: Accept simultaneous contexts. 1/EXCLUSIVE_THREAD: Single context allowed, from a single thread. 2/PROHIBITED: No CUDA contexts allowed. 3/EXCLUSIVE_PROCESS: Single context allowed, multiple threads OK. Most common setting in clusters. Changing this setting requires root access, but it sometimes makes sense to make this user-configurable.
multiple processes to share a single CUDA context Improved performance where multiple processes share GPU (vs multiple open contexts) Easier porting of MPI apps: can continue to use one rank per CPU, but all ranks can access the GPU Server process: nvidia-cuda-mps-server Control daemon: nvidia-cuda-mps-control GPU0 MPS 0 1 2 3 Persistent context Application ranks
be scheduled on cores “local” to the GPUs they use No good “out of box” tools for this yet! hwloc can be help identify CPU <-> GPU locality Can use PCIe dev ID with NVML to get CUDA rank Set process affinity with MPI or numactl Possible admin actions: Documentation: node toplogy & how to set affinity Wrapper scripts using numactl to set “recommended” affinity PCI-e x16 x16 PCI-e x16 x16 CPU0 CPU1 GPU 0 GPU 1 GPU 2 GPU 3
controls which GPUs are visible to a process Comma-separated list of devices export CUDA_VISIBLE_DEVICES=“0,2” Tooling and resource manager support exists but limited Example: configure SLURM with CPU<->GPU mappings SLURM will use cgroups and CUDA_VISIBLE_DEVICES to assign resources Limited ability to manage process affinity this way Where possible, assign all a job’s resources on same PCIe root complex
sanity checks against each GPU in system — Basic sanity checks — PCIe link config and bandwith between host and peers — GPU temperature All checks are configurable – set them up based on your system’s expected values Use cluster health checker to run this for every job — Single command to run all checks — Returns 0 if successful, non-zero if a test fails — Does not require root to run
hot spots — Monitor w/ NVML or OOB via system BMC GPU Power Usage — Higher than expected power usage => possible HW issues Current clock speeds — Lower than expected => power capping or HW problems — Check “Clocks Throttle Reasons” in nvidia-smi ECC error counts
May indicate HW error or programming error — Common non-HW causes: out-of-bounds memory access (13), illegal access (31), bad termination of program (45) Turn on PCIe parity checking with EDAC modprobe edac_core echo 1 > /sys/devices/system/edac/pci/check_pci_parity — Monitor value of /sys/devices/<pci- address>/broken_parity_status
user application Alternatively use CUDA Samples or benchmarks (like HPL) Stress entire system, not just GPUs Do repeated runs in succession to stress the system Things to watch for: — Inconsistent perf between nodes: config errors on some nodes — Inconsistent perf between runs: cooling issues, check GPU Temps — Slow GPUs / PCIe transfers: misconfigured SBIOS, seating issues Get “pilot” users with stressful workloads, monitor during their runs Use successful test data for stricter bounds on monitoring and healthmon
ways to enumerate GPUs: PCIe NVML CUDA runtime These may not be consistent with each other or between boots! Serial number will always map to the physical board and is printed on the board. UUID will always map to the individual GPU. (I.e., 2 UUIDs and 1 SN if a board has 2 GPUs.)
and job configuration — You should provide tools which expose this to your users Use NVML-enabled tools for GPU cofiguration and monitoring (or write your own!) Lots of hooks exist for cluster integration and management, and third-party tools
Documentation in GPU Deployment Kit man pages for the tools (nvidia-smi, nvidia-healthmon, etc) Other talks in the “Clusters and GPU Management” tag here at GTC