Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPU Sharing done right

GPU Sharing done right

In this presentation, Scott McAllister and I explain how to share GPUs across multiple Kubernetes clusters the right way.

This version of the talk was given at HashiConf San Francisco, in September 2025.

Avatar for Kerim Satirli

Kerim Satirli PRO

September 26, 2025
Tweet

More Decks by Kerim Satirli

Other Decks in Programming

Transcript

  1. ©2025 HASHICORP GPU sharing done right Senior Developer Advocate, HashiCorp

    Kerim Satirli Principal Developer Advocate, vCluster Scott McAllister
  2. GPU User User User User User Multi-Instance GPUs (MIG) GPU

    instance 0 GPU instance 1 GPU instance 2 GPU instance 3 GPU instance 4
  3. Time-slicing (Or: custom scheduling) GPU Compute Process 1 Process 2

    Process 3 Process 4 T2 T3 T4 T1 T2 T3 T4 T1 T1 T2 Time slice
  4. Per-team GPU cluster Tenant namespaces GPU node GPU node KAI

    scheduler Tenant A Tenant namespaces GPU node GPU node KAI scheduler Tenant B
  5. Per-team GPU cluster Tenant namespaces GPU node GPU node KAI

    scheduler Tenant A Tenant namespaces GPU node GPU node KAI scheduler Tenant B
  6. pod-1 custom-resource Team-shared cluster deployment pod-1 custom-resource Data store Control

    manager vcluster syncer API server Namespace in virtual cluster Virtual cluster context etcd Control manager Scheduler API server Control plane synced-pod-1 vcluster-pod Namespace in K8s K8s cluster context vcluster-pod Tenant Connects to cluster API server Controls virtual cluster Admin Connects to K8s API server Controls K8s context
  7. Team-shared cluster etcd Control manager Scheduler API server Control plane

    pod-1 Namespace in team clusters Team-cluster context custom-resource
  8. Multi-tenant cluster Tenant A Tenant B tenant-a-proj-1 tenant-b-proj-1 Kubernetes cluster

    Other Operations vcluster-a vcluster-b DGX node DGX node DGX node DGX node
  9. Team-specific security etcd Control manager Scheduler Secrets Team 2 Namespaces

    Team 3 Team 1 Control plane Deployments Vault pod-1
  10. Team-shared cluster Control plane Deployments Vault pod-1 Access for 4h

    Control plane Deployments Vault pod-1 Access for S3 Team 1 Team 2
  11. # ... listener "tcp" { address = "127.0.0.1:8300" chroot_namespace =

    "team-1" tls_cert_file = "/mnt/vault/chain.pem" tls_key_file = "/mnt/vault/private-key.pem" } # ... Configuration vault-config.hcl
  12. # ... listener "tcp" { address = "127.0.0.1:8300" chroot_namespace =

    "team-1" tls_cert_file = "/mnt/vault/chain.pem" tls_key_file = "/mnt/vault/private-key.pem" } # ... Configuration vault-config.hcl