Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DeepSeek on AWS

DeepSeek on AWS

​Title: Deploying and Scaling LLMs on AWS: A Practical Guide to Infrastructure Choices

​Abstract:
​Foundation Models and LLMs now range from billions to trillions of parameters, presenting unique deployment challenges. Organizations must navigate tradeoffs between cost, operational efficiency, and implementation complexity.
​This talk provides a practical guide to deploying LLMs on AWS and hardware accelerators, including GPU and AI chips (AWS Trainium and Inferentia). Yoshitaka will offer step-by-step guidance for deployment using real-world examples like DeepSeek-R1 and its Distill variants and best practices for optimizing cost and performance at scale.

​Bio:
​Yoshitaka Haribara is a Sr. GenAI/Quantum Startup Solutions Architect at AWS and Visiting Associate Professor at Center for Quantum Information and Quantum Biology (QIQB), The University of Osaka. At AWS, he works with leading Japanese generative AI startups including Sakana AI, ELYZA, and Preferred Networks (PFN). He supports Japanese model providers to develop Japanese LLMs and list them (PFN, Stockmark, and Karakuri LLMs) on Amazon Bedrock Marketplace. With his background in combinatorial optimization with quantum optical devices, Yoshitaka also guides customers in leveraging Amazon Braket for quantum applications and works to bring Japanese quantum hardware to AWS. He holds a Ph.D. in Mathematical Informatics from The University of Tokyo and a B.S. in Mathematics from The University of Osaka.

Event page:
https://lu.ma/2whjtkdi?tk=xtQ5qR

Yoshitaka Haribara

February 04, 2025
Tweet

More Decks by Yoshitaka Haribara

Other Decks in Technology

Transcript

  1. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 2 © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Yoshitaka Haribara, Ph.D. T A I A A I # 0 7 - S C A L I N G A I P E R F O R M A N C E Sr. GenAI/Quantum Startup Solutions Architect AWS DeepSeek on AWS
  2. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 3 • DeepSeek-R1 and Distilled Models Overview • Accelerators: NVIDIA H200 GPU and AWS Trainium • Deployment options on AWS: Bedrock, SageMaker AI, and EC2 • Best practices Agenda
  3. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 4 DeepSeek offers a range of open weights models and efficient distilled variants. Base and R1 Models (671B) • DeepSeek-V3: Base MoE model • DeepSeek-R1-Zero: Pure reinforcement learning • DeepSeek-R1: Cold-start data before RL Distilled Models • DeepSeek-R1-Distill-Qwen (1.5B, 7B, 14B, 32B) • DeepSeek-R1-Distill-Llama (8B and 70B) DeepSeek enables organizations to leverage advanced reasoning capabilities across multiple tasks.
  4. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 5 Core Capabilities • Advanced reasoning capabilities optimized for complex problem-solving (e.g. mathematics and coding tasks). • Outperforms on AIME 2024, MATH-500, and SWE-bench Verified. • Reportedly 90-95% more affordable than comparable models. • 671B Mixture of Experts (MoE) architecture, activation of 37B parameter. • DeepSeek-R1 requires at least 800 GB of HBM memory in FP8 format for inference.
  5. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 6 EC2 accelerated compute instances for AI/ML G6 (L4) P5 (H100) DL1 G6e (L40S) P4 (A100) P5e (H200) Inf1 Inf2 P5en (H200) Trn1 GPUs AI/ML accelerators and ASICs Trn2 G5 (A10G) AWS Trainium, Inferentia H100, H200, B200, GB200, A100, L40S, L4, A10G Cloud AI100 Standard Radeon GPU Xilinx accelerator Xilinx FPGA DL2q Gaudi accelerator Announced GB200 B200
  6. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 7 CPU CPU NSC EBS Host EFA PCIe SSD EFA SSD … Switching layer PCIe PCIe PCIe ML chip interconnect ML chip ML chip ML chip ML chip … Accelerators Accelerated compute architecture
  7. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 10 P5 instances Optimized for AI training and inference 900 GB/s NVSwitch for GPU peer-to-peer connections Scale-out with non-blocking interconnect Elastic Fabric Adapter (EFA) Instance GPU GPU memory CPU vCPU Instance memory Networking Local storage P5 8 NVIDIA H100 640 GB AMD Milan 192 2 TB 3200 Gbps EFAv2 30 TB SSD P5e 8 NVIDIA H200 1128 GB AMD Milan 192 2 TB 3200 Gbps EFAv2 30 TB SSD P5en 8 NVIDIA H200 1128 GB Intel SPR 192 2 TB 3200 Gbps EFAv3 30 TB SSD
  8. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 11 Bedrock Marketplace implementation • Bedrock Marketplace enables core DeepSeek-R1 deployment in managed endpoints • Complete code samples and step-by-step deployment guides provided for quick implementation • Standard Bedrock security and monitoring features
  9. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 12 Bedrock Marketplace delivers 100+ models from 30+ providers EVOLUTIONARY SCALE WIDN CAMB.AI GRETEL ARCEE AI PREFERRED NETWORKS WRITER UPSTAGE NCSOFT STOCKMARK KARAKURI JOHN SNOW LABS LIQUID DATABRICKS CYBERAGENT HUGGING FACE STABILITY AI LG AI RESEARCH M I S T R A L AI SNOWFLAKE N V I D I A DEEPSEEK
  10. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 13 Prerequisite: Increase your ml.p5e.48xlarge limits before deployment
  11. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 14 Step1: Find the DeepSeek-R1 model on the catalog
  12. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 15 Step2: Set options (ml.p5e.48xl by default) and deploy
  13. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 16 Step3: Playground or InvokeModel API
  14. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 17 Tips: Use proper chat template (model tokenizer) Example with DeepSeek-Distill-Llama-8B (via Bedrock CMI) 17 <|begin▁of▁sentence|><|User|>A man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red. The lights are out, and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of black socks?<|Assistant|> When using Bedrock Playground, we must add proper chat template tags for optimal results. E.g.: When using InvokeModel API, we must configure proper tokenizer to apply the chat template. E.g.: tokenizer = AutoTokenizer.from_pretrained(hf_model_id) messages = [{"role": "user", "content": test_prompt}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=not continuation) Bad quality output Good quality output
  15. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 18 DeepSeek-R1 Responsible AI concerns 18 (through the ApplyGuardrail API) can provide an extra layer of security and responsible AI measures
  16. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 19 Enterprise Protection • Enterprise-grade security features built-in • Complete data privacy when using AWS services • No data sharing with model providers • End-to-end encryption for all operations • Access controls and governance features • Compliance with AWS security standards
  17. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 20 Critical Concerns • Models hosted by AWS without any communication with DeepSeek servers or APIs • No customer data used to improve base models • Enterprise data protection capabilities • Privacy control through AWS services
  18. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 24 Model Options • Distilled models maintain most core capabilities while reducing latency and cost • Optimized for different computational and performance requirements • DeepSeek-R1-Distill-Llama offered in 8B and 70B versions • DeepSeek-R1-Distill-Qwen available in 1.5B, 7B, 14B, 32B variants (SageMaker AI only)
  19. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 25 Custom Model Import implementation • Bedrock Custom Model Import enables DeepSeek deployment • Support for Llama 8B and 70B distilled DeepSeek R1 variants • Complete code samples and step-by-step deployment guides provided for quick implementation • Standard Bedrock security and monitoring features • Pricing is on-demand in 5-minute window from first successful invocation • There is a cold-start and scaling up/down time
  20. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 43 Trn1/Trn2 instances Powered by AWS Trainium/Trainium2 custom ML chips Optimized for large-scale training distributed workloads Trn2 Ultraservers with extended NeuronLink for trillion-parameter AI Neuron Kernel Interface (NKI) for custom operators Instance Accelerators Accelerator memory vCPU Instance memory Networking trn1.32xlarge 16 512 GB 128 512 GB 800 Gbps EFAv2 trn1n.32xlarge 16 512 GB 128 512 GB 1600 Gbps EFAv2 trn2.48xlarge 16 1.5 TB 192 2 TB 3.2 Tbps EFAv3
  21. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 44 AWS Trainium architecture • Tensor engine are based on power-optimized systolic array • AWS Neuron SDK supports typical architecutres such as Llama
  22. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 45 Summary: DeepSeek-R1 deployment options on AWS 1. Amazon Bedrock Marketplace (Amazon SageMaker JumpStart) for the DeepSeek-R1 model 2. Amazon Bedrock Custom Model Import for the DeepSeek-R1-Distill models 3. Amazon EC2 Trn1 instances for the DeepSeek-R1-Distill models DeepSeek on AWS Blog ↑ https://aws.amazon.com/blogs/aws/deepseek-r1-models-now-available-on-aws/
  23. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 46 Thank you! © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Yoshitaka Haribara, Ph.D. X: @_hariby
  24. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 47 Further reading • DeepSeek • Anthropic CEO Dario Blog • https://darioamodei.com/on-deepseek-and-export-controls • Startup Customer Case Studies on AWS • Sakana AI • https://aws.amazon.com/startups/learn/letting-nature-lead-how-sakana-ai-is- transforming-model-building?lang=en-US • ELYZA (Llama2 Speculative Decoding on AWS Inferentia2 chip) • https://aws.amazon.com/jp/blogs/startup/tech-interview-elyza-2024/ • LLM Development on Trn1 • https://aws.amazon.com/jp/blogs/machine-learning/unlocking-japanese-llms- with-aws-trainium-innovators-showcase-from-the-aws-llm-development-support- program/