Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding_Thread_Tuning_for_Inference_Serve...

 Understanding_Thread_Tuning_for_Inference_Servers_of_Deep_Models.pdf

機械学習モデルによる予測をリアルタイムに提供するためには推論サーバーが不可欠です。CPUで推論を実行するときには、スレッド数の設定がシステムパフォーマンスに大きく影響します。適切な設定を行うことでスループットが数倍向上することも珍しくありません。

本セッションでは推論サーバーのスレッド数チューニングの方法について、その原理から解説します。Triton Inference Serverを例に推論サーバーがどのようにCPUスレッドを割り当てているか、またその結果としてレイテンシーとスループットのトレードオフが発生することを説明します。これらのメカニズムを理解することで、参加者の皆さんが見通しをもってスレッド数チューニングを行えるようになることが今回の講演の目標です。

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. • Working for internal AI platform • Loves performance engineering

    • Hosting Python Meetup Fukuoka About Me ID: zhanpon
  2. • Latency: The time from when a request reaches the

    server until the response begins to be sent (measured by milliseconds) • Throughput: How many requests is being processed per unit time (measured by Requests Per Second, RPS) Latency and Throughput
  3. Goal: understand the relationship of latency, throughput, and concurrency Agenda:

    1. Theory of single-threaded inference server 2. Experiments with single-threaded inference server Part I: Single-threaded Server Agenda
  4. Basic Model Single-threaded Server, Single Client Assumptions • Each task

    takes 100 ms • 1 client Consequences • Latency is 100 ms • Throughput is 10 rps
  5. • Throughput: 10 rps • Latency: concurrency (the number of

    the clients) * 100 ms The Law of Our System
  6. Experiment It! Benchmark setup Triton Performance Analyzer Triton Inference Server

    ResNet-50 random inputs of shape (1, 3, 224, 224) perf_analyzer -u ${TARGET_HOST} -m resnet50 --concurrency-range 1:8
  7. Benchmark Results Compute Time and Queue Time Compute time is

    constant Queue time is proportional to concurrency
  8. • Latency = compute time + queue time • Little’s

    law: latency, throughput, and concurrency are related Summary of Part I
  9. Goal: understand the latency-throughput tradeoff Agenda: 1. Theory of Multi-threaded

    inference server 2. Experiments with Multi-threaded inference server Part II: Multi-threaded Server Agenda
  10. Horizontal vs Vertical Scaling • Each task takes 100 ms

    • Serve 2 clients in parallel • Each task takes 60 ms • Serve clients one by one
  11. • Vertical scaling provides minimum latency (if not busy) •

    Horizontal scaling provides maximum throughput (if busy) Latency-Throughput Tradeoff
  12. # config.pbtxt platform: "onnxruntime_onnx" # Number of model instances (for

    horizontal scaling) instance_group [{ count: 1 kind: KIND_CPU } ] # Intra-op parallelism (for vertical scaling) parameters { key: "intra_op_thread_count" value: { string_value: "1" } } Experiment It! Configurations for Triton Inference Server
  13. • There are different ways of scaling inference server: horizontal

    and vertical • Horizontal scaling provides maximum throughput • Vertical scaling provides minimum latency Summary of Part II
  14. • Case 1: Throughput matters • Case 2: Watch out

    for context switches • Case 3: Avoid CPU throttling in container environment Case Studies
  15. • One day, I saw a team deploys inference servers

    with intra-op parallelism 23 Throughput Matters
  16. • 20% increase of throughput → 17% less servers •

    These servers process 200k requests per second Throughput Matters
  17. Watch out for Context Switches Ok, I'll take a look.

    Umm......, system CPU usage is 14%, that's a lot. The latency increases when two servers deployed to a same node. Wow, the latency drops. Thanks! Hi, can you decrease TF_NUM_INTRA_THREADS.
  18. Avoid CPU Throttling Latency is 300 ms ! That’s incredibly

    slow. Hi, our inference servers are very slow. We use gradient boosting model. Wow, the latency is 10 ms now! OK, I figured it out. Try OMP_NUM_THREADS=1.
  19. • Our inference server platform runs on Kubernetes • The

    container runtime stops the processes which exceeds the CPU limit (if configured) What is a CPU Throttling?
  20. Inference server: “Yay! I’m running on a 32-core machine. I’ll

    spin up 32 threads and do tons of inferences.” Kubernetes: “No, you are not allowed to get that much CPU time.” Why CPU Throttling?
  21. • DO NOT rely on default thread config • Start

    with intra-op parallelism 1 • Watch out for some key metrics Takeaways