Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Technologista 2024 - Rust for Data - What Works...

Avatar for Karn Wong Karn Wong
October 25, 2024

Technologista 2024 - Rust for Data - What Works and What Doesn't

Avatar for Karn Wong

Karn Wong

October 25, 2024
Tweet

More Decks by Karn Wong

Other Decks in Technology

Transcript

  1. About me kahnwong Karn Wong karnwong.me Platform Engineer @Data Cafe

    Company Limited Used to work as a Data Engineer and Machine Learning Engineer Tinkers with Go, Python and Rust Deployed bajillion things to production
  2. Table of contents 1. Types of Data Work 2. Data

    Engineering Workloads 3. Machine Learning Workloads 4. When to Use Rust 5. When Not to Use Rust 6. Bonus: Boosting Python Performance via Rust FFI 7. Conclusion 1. Should You Use Rust?
  3. Types of Data Work Type Java / Scala R Python

    Rust Data Analysis ❌ ✅ ✅ ❌ Data Engineering ✅ ❌ ✅ ✅ Machine Learning ☑️ ✅ ✅ ✅ Our main focus today is on Python vs Rust comparison. As in: Rust for Data Engineering and Machine Learning. Based on how common it is to adopt a language for a particular domain
  4. Data Engineering: Python For small data, pandas is very popular

    But it does not have strong typing features You can’t be sure what your dtypes are ⚠️⚠️⚠️ For big data, pyspark is king If you are not familiar with Spark Dataframe API , you can use Spark SQL pandas API is also available Middle of the road solution is polars , a Rust-based similar to pandas , but more performant Also has strong typing features, courtesy of Rust
  5. Data Engineering: Rust Polars ( pandas alternative) 30k stars Rust

    API documentation is still lacking compared to Python’s Verdict: you should use polars via Python. Better integration to existing ecosystem that way Datafusion ( spark alternative) 6.1k stars Benchmark: https://andygrove.io/2018/03/datafusion-0.2.1-benchmark/ Still not widely adopted Verdict: probably don’t use it in production for now, unless you know what you are doing
  6. Machine Learning: Python Machine Learning Workflow Data Preparation Model Training

    Model Evaluation Model Deployment Model Training Various frameworks to choose from: spark mlib , scikit-learn , pytorch , tensorflow Large ecosystem to support distributed training and model deployment Model Deployment Mostly by utilizing FastAPI to expose your model as API endpoint Real-time inference involves input validation and data prep ️ ⚠️⚠️️ ⚠️
  7. Machine Learning: Rust Model Training Various frameworks for model training:

    candle-core , burn , tensorflow , torch-rs , linfa Sparse documentation Verdict: probably better to still use Python for model training due to better ecosystem Model Deployment Can convert your model to onnx or gguf and serve it via Rust Rust also has backend frameworks similar to FastAPI , such as actix , axum Strong type safety means the Rust compiler would catch errors during compilation time, and during data processing pre-inference In Python, sometimes you need to run the code to see the errors Better memory management than Python, which means Rust is faster for the same Python workload Verdict: use Rust-based solutions if you want better performance and stability
  8. When to Use Rust Rust is suitable when you have

    to create APIs for other services to use (model deployment) A lot of production issues when using Python is unforeseen bugs under nested statements Rust would catch these errors during compilation time Additionally, if your services need low latency, Rust would be perfect because it’s very fast
  9. When Not to Use Rust Due to the nature of

    data processing where a lot of transformations are involved, data types would change all the time Using Rust would make things more complicated and verbose For model training, there are not many resources to do this in Rust Some light transformations are still involved, so Rust would be counterintuitive in this case
  10. Conclusion Model deployment with Rust results in better durability and

    less bugs in production, because you would know about the bugs during compilation time You can use Rust for data processing, model training and model deployment Data processing in Rust translates to more overhead due to strong typing features - and Rust has less mature ecosystem compared to Python Model training in Rust is similar to data processing, namely lack of ecosystem and readily available examples
  11. Should You Use Rust? It’s harder to find Rust developers

    :( Most data folks are only familiar with Python Rust takes longer to write (but it will pay off in production due to less bugs) If you managed to find Rust developers to work for you, great! But if they leave, how are you going to maintain Rust-based solutions in your organization?