Pycon Thailand 2025 - ML Model Serving Optimization with ONNX

ML Model Serving Optimization with ONNX Pycon Thailand 2025-10-18

Karn Wong Loves optimization Has too much fun cranking out
benchmarks HashiCorp Ambassador & AWS Community Builder Blog & Portfolio karnwong.me Say hi at Bluesky @karnwong.me Independent Consultant

Machine Learning Is Around Us Ride-hailing apps Hotel booking apps
Flight booking apps Social media

What Is Machine Learning Used For Price prediction Route duration
prediction Recommendations Content monitoring

Machine Learning Primer Input Model Output

Price Prediction Revisited Input user attributes target attributes Output price

Route Duration Prediction Revisited Input user attributes time of day
current traffic holidays route Output route duration

Recommendations Revisited Input user’s browsing history specified preferences Output content
recommendations

Content Monitoring Revisited Input content user’s activity user’s previous flags
Output content classification

ML Inference Anatomy InferenceEndpoint App Transform Input Return Output Frontend
Backend Parse Input Inference

Creating ML Models scikit-learn pytorch tensorflow etc.

Serving ML Models Flask Django FastAPI etc.

Current Issues Slow inference time Slower api response time Python
has a lot of overhead

What You Can Do Split model into different segments Increased
code complexity Watch out for data drift But it’s a lot of work 😖

What You Can Do (cont.) Reduce training data size Accuracy
tradeoff Have to make sure data distribution is the same

What About Model Size? Shouldn’t be an issue in 2025
LLMs are waaaaaay larger than most models Back then it was an issue Ops don’t understand ML infra Path of least resistance: Function as a Service (FaaS) FaaS (then) have limited storage capacity (around 500 MB)

Enter ONNX Ligher footprint than raw models created via scikit-learn,
pytorch, etc. Ecosystem-agnostic can be served outside of python Faster inference time

What is ONNX? Open Neural Network Exchange Created by the
PyTorch team at Facebook in 2017 Was accepted as graduate project in Linux Foundation AI in 2019 Interoperability between frameworks* *https://onnx.ai/supported-tools.html

Benchmark Results

Benchmark Results (cont.) Python API: scikit-learn vs ONNX Python’s scikit-learn
vs Rust’s ONNX Slowest vs fastest runtime combination. scikit-learn median runtime: 0.0145s ONNX median runtime: 0.0131s Performance difference: 0.0993s Performance boost: 9.93% Python scikit-learn median runtime: 0.0145s Rust ONNX median runtime: 0.0120s Performance difference: 0.1718s Performance boost: 17.18%

What About Model Size? LLM ️ ➡️ ONNX 🟰 SLM
(Small Language Model)? SLM should be able to run on edge and embedded devices Typically less than 10 billion parameters ONNX serialization does not reduce parameter size Quantization can reduce parameter size GGUF is better for LLM 👀 LLM edition

Conclusion Models created via python-based frameworks can be serialized into
ONNX Results in faster inference time Significant speed boost from scikit-learn/pytorch/etc to ONNX Using a compiled language (ex. Rust) can speeds it up further

Workshop Repo Instructions inside the repo

Thank you 🙏 Download slides at: karnwong.me

Pycon Thailand 2025 - ML Model Serving Optimiza...

Pycon Thailand 2025 - ML Model Serving Optimization with ONNX

Karn Wong

More Decks by Karn Wong

Other Decks in Technology

Featured

Transcript

ML Model Serving Optimization with ONNX Pycon Thailand 2025-10-18

Karn Wong Loves optimization Has too much fun cranking out

Machine Learning Is Around Us Ride-hailing apps Hotel booking apps

What Is Machine Learning Used For Price prediction Route duration

Machine Learning Primer Input Model Output

Price Prediction Revisited Input user attributes target attributes Output price

Route Duration Prediction Revisited Input user attributes time of day

Recommendations Revisited Input user’s browsing history specified preferences Output content

Content Monitoring Revisited Input content user’s activity user’s previous flags

ML Inference Anatomy InferenceEndpoint App Transform Input Return Output Frontend

Creating ML Models scikit-learn pytorch tensorflow etc.

Serving ML Models Flask Django FastAPI etc.

Current Issues Slow inference time Slower api response time Python

What You Can Do Split model into different segments Increased

What You Can Do (cont.) Reduce training data size Accuracy

What About Model Size? Shouldn’t be an issue in 2025

Enter ONNX Ligher footprint than raw models created via scikit-learn,

What is ONNX? Open Neural Network Exchange Created by the

Benchmark Results

Benchmark Results (cont.) Python API: scikit-learn vs ONNX Python’s scikit-learn

What About Model Size? LLM ️ ➡️ ONNX 🟰 SLM

Conclusion Models created via python-based frameworks can be serialized into

Workshop Repo Instructions inside the repo

Thank you 🙏 Download slides at: karnwong.me