Introduction to GKE x LLM

Introduction to GKE x LLM @soma00333 Kubernetes Novice Tokyo #31
(2024/04/09)

2 • @soma00333 • CTO@Industry Technology Inc. • SRE@enechain Corporation
• X:@soma00333 • GitHub:@soma00333

4 • LLMの世界は変化が激しい ◦ Today is 2024/04/09 • Opinions Are
My Own Introduction Announcement

5 • 自社でLLMをホストするのが有効なケースが存在する • LLMのホスト先としてKubernetesは有効である Introduction Message

6 Introduction Table of contents • What is LLM •
自社でLLMをホストするケース • Use GPU on Kubernetes • Deploy LLM on GKE

7 What is LLM • Overview • What is Hugging
Face • What is TGI

8 What is LLM Overview • Large Language Model（大規模言語モデル）の略 •
TransformerをベースにしたDeep Learning Model • Attention Is All You Need (2017 Google) でTransformerが提案された • 大規模化＆マルチモーダル化のトレンド • e.g. ◦ Open AI:GPT-4 ◦ Google:Gemini,Gemma ◦ Anthropic:Claude3 https://arxiv.org/abs/1706.03762 https://speakerdeck.com/pfn/llmnoxian-zai

9 What is LLM What is Hugging Face • Hugging
Face社が提供するPlatform • ML分野のGitHub的存在 • 学習済みModelや学習用のDatasetの公開・ダウンロードが可能 https://huggingface.co/google/gemma-7b

10 What is LLM What is TGI • Text Generation
Inference (TGI) • HuggingFace社が開発するLLM向けの Library • 推論Serverで使用 • Rust,Python,gRPCで構築されている • 同時requestに対するresponse timeと latencyの課題を解決 ◦ cf. vLLM https://github.com/huggingface/text-generation-inference https://github.com/vllm-project/vllm

11 自社でLLMをホストするケース • Productの一機能としてのLLM • ApplicationのArchitecture例 • 自社ホストするケース1 •
自社ホストするケース2 • ホスト先としてのKubernetes

12 自社でLLMをホストするケース Productの一機能としてのLLM • Productに一機能としてLLMを組み込むトレンドがある • e.g. ◦ Notion AI
◦ Gemini for xxx ▪ Workspaces ▪ Google Cloud ◦ Microsoft Copilot ◦ GitHub Copilot ◦ Slack AI https://slack.com/intl/ja-jp/blog/news/slack-ai-has-arrived

13 • Algomatic社シゴラクAI(法人向け ChatGPTアプリケーション)の実装例 • Azure OpenAI, OpenAIのAPIを利用 • LLM
Gatewayを作成しLLMを代替可能に • →特定のLLMに依存するリスクを回避 • →新Model、新機能への対応を容易に ◦ e.g. Claude3 自社でLLMをホストするケース ApplicationのArchitecture例 https://speakerdeck.com/tkikuchi1002/llm-engineering-architecture?slide=33 https://shigoraku.ai/

14 自社でLLMをホストするケース自社ホストするケース1(自社LLMのみ) • データプライバシーの問題 • 著作権の問題 ◦ トレーニングデータ ◦
生成物 ◦ cf. AI時代の知的財産権検討会（第１回） • 自社・顧客の求める要件が厳しい ◦ 日本国外に情報を置きたくない ◦ 外部に機密データを出したくない • cf. ◦ NECのオンプレx自社LLMによる電子カルテ・医療文書作成支援 https://www.kantei.go.jp/jp/singi/titeki2/ai_kentoukai/gijisidai/index.html https://jpn.nec.com/rd/technologies/202313/index.html

15 自社でLLMをホストするケース • APIの制限への対応 ◦ Context Length制限 ◦ RateLimit •
特定Taskの精度向上 ◦ e.g. 日本語性能 • 小型Model使用→Cost・Performance改善 • cf. ◦ OpenAI社のGPT系統モデル及びELYZAの自社モデルを複合的に組み合わせた実証プロジェクト自社ホストするケース2(複数利用) https://prtimes.jp/main/html/rd/p/000000026.000047565.html

16 自社でLLMをホストするケースホスト先としてのKubernetes • 大量のコンテナを一括管理 • スケーリングが容易 • 負荷分散が容易 •
ガバナンス • Observability • … • ApplicationとInference Serverで同じ基盤を使える • GPU利用可 • GPUを効率よく使う仕組みが検討されている ◦ cf. Dynamic resource allocation https://speakerdeck.com/bells17/kep-3063-dynamic-resource-allocation

17 Use GPU on Kubernetes • Overview • What is
Device Plugin • How to use GPU • Use GPU on GKE Standard • Use GPU on GKE Autopilot

18 Use GPU on Kubernetes Overview https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ • Process ◦
1.NodeにGPU Vendor (e.g. AMD, NVIDIA)のGPU Driverをinstall ◦ 2.GPU VendorのDevice Pluginをdeployする ◦ 3.PodがGPU VendorのGPUにアクセスできる • GKE Autopilot/Standard でGPU利用可能

19 • KubernetesはDevice Plugin Frameworkを提供 • VendorはDevice Plugin Frameworkを利用してDevice Pluginを実装している
◦ e.g. AMD, NVIDIA • Process ◦ 1.Schedulerが適切なNodeを見つける ◦ 2.KubeletがDevice plugin Managerを呼び出して device ID を取得し、Device pluginに送信 ◦ 3.Device PluginがDevice Driverにaccessして device Path,driver directory,environment variablesを取得しKubelet に返す ◦ 4.KubeletがDeviceをContainer Runtimeに割り当てて起動 Use GPU on Kubernetes What is Device Plugin https://intel.github.io/kubernetes-docs/device-plugins/index.html https://github.com/ROCm/k8s-device-plugin https://github.com/NVIDIA/k8s-device-plugin https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

20 Use GPU on Kubernetes How to use GPU •
NodeにNVIDIAのGPU Driverをinstall • NodeにNVIDIA Device Pluginをdeploy • resourcesにnvidia.com/gpuを記載 https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

21 Use GPU on Kubernetes Use GPU on GKE Standard
• StandardはGPU用のNode Poolを作成する必要がある • GPU Node Poolを作成 ◦ GPU Driverを自動的にinstallするように指示 https://cloud.google.com/kubernetes-engine/docs/how-to/gpus

22 Use GPU on Kubernetes Use GPU on GKE Standard
• Podをdeploy ◦ resourcesにgpu指定 https://cloud.google.com/kubernetes-engine/docs/how-to/gpus

23 • Podをdeploy ◦ GPU用のNode Poolの作成が必要ない ◦ nodeSelectorでGPU Type指定 ◦
resourcesにgpu指定 ◦ 内部でNode auto-provisioningを使用 • Autopilotは使用できるGPUの種類や機能に制限がある ◦ e.g. ▪ Time-sharing GPU ▪ Multi-instance GPU Use GPU on Kubernetes Use GPU on GKE Autopilot https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus https://cloud.google.com/kubernetes-engine/docs/concepts/timesharing-gpus https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi

24 Deploy LLM on GKE • Overview • Get Hugging
Face token • Create Cluster • Create Secret • Deploy model • Get response

25 • Model:Gemma2B • Library:Text Generation Inference (TGI) • Image:pytorch-hf-tgi-serve
◦ pre build container from Google ◦ 推論Serverとして機能 ◦ PyTorch,TGIを使用 ◦ 環境変数に以下を指定 ▪ MODEL_ID:Hugging Face上のmodelを指定 ▪ HUGGING_FACE_HUB_TOKEN:API Tokenを指定 Deploy LLM on GKE Overview https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi https://huggingface.co/google/gemma-2b https://github.com/huggingface/text-generation-inference

26 • Artifact RegistryからImage pull • Container起動時にHugging Faceから Model download
• envでModelとTokenを指定 • Resource ◦ Service ◦ Deployment ◦ Secret Gemma2B Hugging Face Deploy LLM on GKE Overview https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi pytorch-hf-tgi-serve Artifact Registry llm-cluster Google Kubernetes Engine Download model Pull image

27 Deploy LLM on GKE Get Hugging Face token https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi
• Hugging Face Tokenを取得

28 Deploy LLM on GKE Create Cluster https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi • Cluster作成
• Autopilotを指定

29 Deploy LLM on GKE Create Secret https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi • Secret作成
• Hugging Face Token を登録

30 Deploy LLM on GKE Deploy model https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi • Deployment作成
◦ image ◦ env • Service作成

31 Deploy LLM on GKE Get response https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi • port-forwardしresponse取得

32 Summary

33 • 自社でLLMをホストするのが有効なケースが存在する ◦ 1.自社LLMのみ ◦ 2.複数利用 • LLMのホスト先としてKubernetesは有効である ◦
簡単にdeployできる • Future work ◦ GPUの効率的な使用 ▪ cf. Dynamic resource Allocation ◦ Inference Serverの保守運用 ▪ Observability, SLI/SLO, Cost Management ◦ 自社LLM開発のノウハウ ▪ Fine-tuning, 量子化, Model評価, 良質なDataset取得 Summary Summary https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/

34 • https://arxiv.org/abs/1706.03762 • https://speakerdeck.com/pfn/llmnoxian-zai • https://huggingface.co/google/gemma-7b • https://github.com/huggingface/text-generation-inference •
https://github.com/vllm-project/vllm • https://slack.com/intl/ja-jp/blog/news/slack-ai-has-arrived • https://speakerdeck.com/tkikuchi1002/llm-engineering-architecture?slide=33 • https://shigoraku.ai/ • https://www.kantei.go.jp/jp/singi/titeki2/ai_kentoukai/gijisidai/index.html • https://speakerdeck.com/bells17/kep-3063-dynamic-resource-allocation • https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ • https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/ • https://intel.github.io/kubernetes-docs/device-plugins/index.html • https://github.com/ROCm/k8s-device-plugin • https://github.com/NVIDIA/k8s-device-plugin • https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ Summary Reference

35 Summary Reference • https://cloud.google.com/kubernetes-engine/docs/how-to/gpus • https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus • https://cloud.google.com/kubernetes-engine/docs/concepts/timesharing-gpus •
https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi • https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi • https://huggingface.co/google/gemma-2b • https://github.com/huggingface/text-generation-inference

36 We are hiring

Introduction to GKE x LLM

Introduction to GKE x LLM

Other Decks in Technology

Featured

Transcript