Amazon EKS はどのように 1 クラスタ 10 万ノードに対応したのか / Under the Hood EKS Ultra Scale Cluster

© 2025, Amazon Web Services, Inc. or its affiliates. All
rights reserved. Amazon EKS はどのように 1 クラスタ 10 万ノードに対応したのか Ryota Yamada KUBERNETES MEETUP TOKYO#71 Global Automotive Solutions Architect

rights reserved. ⾃⼰紹介

rights reserved. ⾃⼰紹介⼭⽥遼太 / Ryota Yamada Global Automotive Solutions Architect • ⾃動⾞業界のお客様を担当し⾃動運転や SDV の実現を技術的に⽀援 TFC Containers Japan Lead • コンテナサービスのスペシャリストとして、 DevOps, Platform Engineering, Observability, Resiliency など様々な観点から、コンテナアプリケーション開発を⽀援

rights reserved. ⾃動運転モデル開発を⽀援私の過去の発表もご覧ください Data Centric AI → データ中⼼でのモデル開発 Amazon EKS + Ray を活⽤することが⼀般的 • 1000億マイルの⾛⾏距離が学習に必要という⾔説も • https://arxiv.org/pdf/2311.16038 • 1000 億マイルは数百 ExaByte と推定される AWS における⾃動運転モデル開発事例 https://aws.amazon.com/jp/automotive/autonomous-mobility/

rights reserved. 本セッションについて

rights reserved. 本セッションについて前提知識 Kubernetes クラスタの基本的なコンポーネントに対する基礎的な知識、基本的な動作を理解している⽅ kube-apiserver, etcd, bbolt, kube-scheduler, built-in controllers, CNI, CoreDNS, kube-proxy, reconciliation loop, Admission WebHoook Raft MVCC といった、データベースに関する基本的な技術本⽇の趣旨 Amazon EKS がどのようにして、⼤規模クラスタを実現したのか。

rights reserved. Amazon EKS Ultra Scale Cluster

rights reserved. ⼤規模⾔語モデル学習での EKS 利⽤最先端基盤モデルの開発者である、 • Anthropic (Claude) • Amazon AGI (Nova) をはじめとする、最先端の基盤モデルにおいて、 Amazon EKS と Ultra Scale Cluster を採⽤しています。 https://aws.amazon.com/jp/blogs/containers/amazon- eks-enables-ultra-scale-ai-ml-workloads-with-support- for-100k-nodes-per-cluster/

rights reserved. Amazon EKS Ultra Scale Cluster • シングルクラスターにおいて、 10 万ノードをサポートします。 • 最⼤160万のTrn2インスタンス P5e/P6インスタンスによる80万のNVIDIA H200/Blackwell GPU 性能を活⽤することに相当します。 • ⼤規模な分散学習、⼤規模基盤モデルの学習、ファインチューニングなどの⽤途に利⽤されることを想定しています。 • EKSは、AI/ML訓練および推論ワークロード専⽤に構築されることを前提とし、 10万ノードクラスターに対して⼀連の最適化を⾏いました。これらの最適化は、汎⽤⽬的や⾮AI/MLワークロードを対象としません。 • EKSコンソール、API、またはCLIを使⽤してセルフサービス⽅式で10万ノードクラスターにオンボーディングすることはできません。 • EKS において⼤規模クラスタをご検討の際には、AWS までお問い合わせいただくようお願いいたします。

rights reserved. Kubernetes Cluster のノード数を 1 つから順に増加させて⾏った時に、どの部分がボトルネックとして浮かび上がってくるでしょうか︖

rights reserved. ワーカーノード数を増やしていくと起きる課題 1. etcd 2. API Server 3. Build-in Controller / kube-scheduler ( excessive object count ) 4. Admission Webhook 5. Run out / failure of Worker Node 6. Cluster Network ( CoreDNS / kube-proxy / iptables / NAU ) 7. Telemetry Collection ( FluentBit / Otel Exporter/ Prometheus ) 8. Container Image Distribution 9. Supported Number of Nodes and evaluation

rights reserved. どのように etcd のスケーラビリティを改善したか ① 分散合意を Raft から Journal へ移⾏ ② BoltDB のストレージを EBS → In-memory ③ リソースタイプごとに etcd をシャーディング

rights reserved. ① 分散合意を Raft から Journal へ移⾏ • Raft の状態遷移（Follower/Candidate/Leader/Learner）とログ複製・選挙を純粋実装。アプリ固有の I/O は持たず、外部に「やってほしいこと」を Ready 構造体で通知する • https://github.com/etcd-io/raft/blob/main/node.go • Node.Ready() で受け取る Ready を処理︓ • ① WAL への追記・fsync、② スナップショット保存、③ 各ピアへの RPC 送信、④ コミット済みエントリの適⽤を順序正しく実⾏する。 • https://github.com/etcd-io/etcd/blob/main/contrib/raftexample/raft.go • Raft の WAL（ハードステートとログ）と snapshot（圧縮点）をクラスタメンバーのクォーラムがディスクに永続化。再起動時は snapshot＋WAL をリプレイして Raft 状態とアプリ状態を復元 • https://etcd.io/docs/v3.6/learning/persistent-storage-files/ • コミット後のエントリは etcd の MVCC バックエンド（bbolt）に適⽤され、リビジョンが進む。ウォッチはこの MVCC の連番に紐づく Etcd は Raft を⽤いてすべてのクラスターメンバー間で⼀貫性のあるレプリケーションされたトランザクションログを維持既存の etcd のトランザクション保持の流れ

rights reserved. ① 分散合意を Raft から Journal へ移⾏ Amazon 内部で 10 年以上使われ、S3/DynamoDB/Kinesis/Lambda などを⽀える複製ログサービスでは⼀⽅で Journal とは何か Journal 永続性 / 原始性を提供 https://ajalab.github.io/posts/2025-08-14-journal-distributed-log-replication-behind-aws/ https://youtu.be/huGmR_mi5dQ Transaction Log Ordered Data Replication Journal が Transaction Log の永続性について保証するため、クォーラム要件に縛られることなく etcd レプリカを⾃由にスケールでき、ピアツーピアでの通信の必要性を排除しました。

rights reserved. ② BoldDB のストレージを EBS → In-memory BoltDBの保存先を EBS から tmpfs（インメモリストレージ）に移⾏ 1. Transaction log を過去のスナップショット地点から再⽣することで、現在の状態を復元可能 2. つまり、Transaction log の耐久性こそが、データベースの耐久性である 3. Transaction log の耐久性については Journal が保証している 4. コミット後のについて⾼い耐久性を提供せずとも DB としての耐久性は保たれるので、EBS ではなくインメモリストレージを利⽤しても問題がない

rights reserved. ③ リソースタイプごとに etcd をシャーディング etcd 例えば Node 例えば Pod 例えば RBAC 例えば Endpoint その他特にオブジェクト数の多い、ホットなリソースタイプを別々の etcd クラスターに分割 ※ この図で割り当てられているリソース名は、あくまでも例であり実際の割り当てにおけるパーティションとは無関係です etcd etcd etcd etcd

rights reserved. 極限のスループットを持つ API サーバー • API サーバーとウェブフックのチューニング • キャッシュからの整合性のある読み取り • KEP-2340: Consistent Read from Cache • ⼤規模コレクションの効率的な読み取り • KEP-5116: Streaming Encoding for LIST Responses • カスタムリソースのバイナリエンコーディング • KEP-4222: CBOR Serializer

rights reserved. etcd の progress notifications etcd の Watch ストリームに対し、「このストリームは今、リビジョン R までは届いている」ことを⽰す空イベント（ヘッダの revision のみ）をサーバから送らせる仕組み。これにより、キャッシュがどこまで最新かを外側が判定できます。 https://etcd.io/docs/v3.5/dev- guide/interacting_v3/#watch-progress

rights reserved. Kube API Server の resourceVersion リソースバージョンは、オブジェクトのバージョンを識別する値 v1.meta/ObjectMeta resource インスタンスの metadata.resourceVersion は、そのインスタンスが最後に変更されたリソースバージョンを識別します。 v1.meta/ListMeta リソースコレクション（list 応答）の metadata.resourceVersion は、そのコレクションが作成された時点のリソースバージョンを識別します。 Get 時の挙動 resourceVersion 未設定 → Most Recent（最新） resourceVersion="0" → Any（任意） resourceVersion=その他の値 → Not older than（指定値以降） https://kubernetes.io/docs/reference/using-api/api-concepts/

rights reserved. Consistent Reads from Cache: KEP-2340 課題 • resourceVersion 未指定の GET/LIST は etcd Quorum read が要件（強整合） • Label/Field Selector は etcd 全件取得→ API Server 側でフィルタ→GC 特に kubelet の「⾃ノードの Pod だけ⾒たい」クエリが上記に該当解決策（KEP-2340） • API Server の watch cache が etcd のどのリビジョンまで追いついているかを progress notifications (100 ms 間隔) で把握 • ⼀貫性の要件を満たすまで待機（必要時のみ）→ 新しければ cache から返答 https://github.com/kubernetes/enhancements/issues/2340

rights reserved. KEP-5116: Streaming Encoding for LIST Responses 課題 • Kube API Server の List API がレスポンスを返す際には、応答全体を単⼀のバッファにシリアライズし、そのまま⼀括で ResponseWriter.Write する必要があった。 • オブジェクト数が巨⼤な場合、メモリ負荷が増⼤する課題があった。解決策（KEP-5116） • Items 配列の要素を、まとめてエンコードするのではなく、⼀つずつ個別に処理し、逐次送信しながらメモリを解放する https://ahmet.im/blog/kubernetes-list-performance https://github.com/kubernetes/enhancements/issues/5116

rights reserved.

rights reserved. Informer Cache kube-controller-manager が共有している Informer の Indexer で RWMutex の競合が起きコントローラ側の⼤規模な List が⻑時間 RLock を握るため、DeltaFIFO → Informer / Indexer 更新が詰まりイベント処理が遅延／キャッシュが陳腐化する。 https://github.com/kubernetes/kubernetes/issues/130767 課題 1. CacheController が DeltaFIFO から Pop → processDeltas で Store を更新する際、Indexerの WLock が必要。 2. ⼀⽅、 kube-controller-manager の複数コントローラ（StatefulSet / DaemonSet など）が Informer の Store から List で⼤量取得し、RLock を⻑時間保持。 3. その間 WLock が取得できず、DeltaFIFO のキューが滞留 → イベント処理遅延 → キャッシュが古くなる → 誤動作（例: 既に Running なのに NotRunning だと判断し作成リトライ）原因対策「特定 Namespace のラベルに⼀致する Pod」など、ワークロード特性に沿った IndexFunc を⾜す Ref: https://github.com/kubernetes/kubernetes/pull/132396

rights reserved. Appendix. カスタムインデックスの例 podInformer.Informer().AddIndexers(cache.Indexers{ "nsLabel:team": func(obj interface{}) ([]string, error) { p := obj.(*v1.Pod) nsLabel := p.Labels["kubernetes.io/metadata.name"] team := p.Labels["team"] if team == "" { return []string{}, nil } return []string{nsLabel + "/" + team}, nil }, }) pods, _ := podInformer.Informer().GetIndexer().ByIndex("nsLabel:team", "kube-system/network")

rights reserved. https://aws.amazon.com/jp/blogs/news/under-the- hood-amazon-eks-ultra-scale-clusters/ale-clusters/

rights reserved. Thank you!

Amazon EKS はどのように 1 クラスタ 10 万ノードに対応したのか / Under...

Amazon EKS はどのように 1 クラスタ 10 万ノードに対応したのか / Under the Hood EKS Ultra Scale Cluster

riita10069

More Decks by riita10069

Featured

Transcript

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All