Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pycon Thailand 2025 - ML Model Serving Optimiza...

Avatar for Karn Wong Karn Wong
October 20, 2025

Pycon Thailand 2025 - ML Model Serving Optimization with ONNX

Avatar for Karn Wong

Karn Wong

October 20, 2025
Tweet

More Decks by Karn Wong

Other Decks in Technology

Transcript

  1. Karn Wong Loves optimization Has too much fun cranking out

    benchmarks HashiCorp Ambassador & AWS Community Builder Blog & Portfolio karnwong.me Say hi at Bluesky @karnwong.me Independent Consultant
  2. What Is Machine Learning Used For Price prediction Route duration

    prediction Recommendations Content monitoring
  3. Route Duration Prediction Revisited Input user attributes time of day

    current traffic holidays route Output route duration
  4. What You Can Do Split model into different segments Increased

    code complexity Watch out for data drift But it’s a lot of work 😖
  5. What You Can Do (cont.) Reduce training data size Accuracy

    tradeoff Have to make sure data distribution is the same
  6. What About Model Size? Shouldn’t be an issue in 2025

    LLMs are waaaaaay larger than most models Back then it was an issue Ops don’t understand ML infra Path of least resistance: Function as a Service (FaaS) FaaS (then) have limited storage capacity (around 500 MB)
  7. Enter ONNX Ligher footprint than raw models created via scikit-learn,

    pytorch, etc. Ecosystem-agnostic can be served outside of python Faster inference time
  8. What is ONNX? Open Neural Network Exchange Created by the

    PyTorch team at Facebook in 2017 Was accepted as graduate project in Linux Foundation AI in 2019 Interoperability between frameworks* *https://onnx.ai/supported-tools.html
  9. Benchmark Results (cont.) Python API: scikit-learn vs ONNX Python’s scikit-learn

    vs Rust’s ONNX Slowest vs fastest runtime combination. scikit-learn median runtime: 0.0145s ONNX median runtime: 0.0131s Performance difference: 0.0993s Performance boost: 9.93% Python scikit-learn median runtime: 0.0145s Rust ONNX median runtime: 0.0120s Performance difference: 0.1718s Performance boost: 17.18%
  10. What About Model Size? LLM ️ ➡️ ONNX 🟰 SLM

    (Small Language Model)? SLM should be able to run on edge and embedded devices Typically less than 10 billion parameters ONNX serialization does not reduce parameter size Quantization can reduce parameter size GGUF is better for LLM 👀 LLM edition
  11. Conclusion Models created via python-based frameworks can be serialized into

    ONNX Results in faster inference time Significant speed boost from scikit-learn/pytorch/etc to ONNX Using a compiled language (ex. Rust) can speeds it up further