As more companies use large scale machine learning (ML) models for training and evaluation, offline batch inference becomes an essential workload. A number of challenges come with it: managing compute infrastructure; optimizing use of all heterogeneous resources; and transferring data from storage to hardware accelerators. Addressing these challenges, Ray performs significantly better as it can coordinate clusters of diverse resources, allowing for better utilization of the specific resource requirements of the workload.
In this talk we will talk about:
* What are the challenges and limitations
* Examine three different solutions for offline batch inference: AWS SageMaker
* Batch Transform, Apache Spark, and Ray Data.
* Share our performance numbers showing Ray data as the best solution for offline batch inference at scale