In some cases, the machine learning pipeline stages may be distributed across regions or clouds. Data preprocessing, model training, and inferencing are in different regions/clouds to leverage special resource types or services that exist in a particular cloud, and to reduce latency by placing inference near user-facing applications. Additionally, as GPUs remain scarce resources, it is getting more common to set up remote training clusters from where data resides. This multi-region/cloud scenario introduces challenges of losing data locality, resulting in latency and expensive data egress costs.
In this talk, Beinan Wang, Senior Staff Software Engineer from Alluxio, will discuss how Alluxio’s open-source distributed caching system integrates with Ray in the multi-region/cloud scenario:
* The data locality challenges in the multi-region/cloud ML pipeline
* The stack of Ray+PyTorch+Alluxio to overcome these challenges, optimize model training performance, save on costs, and improve reliability
* The architecture and integration of Ray+PyTorch+Alluxio using POSIX or RESTful APIs
* ResNet and BERT benchmark results showing performance gains and cost savings analysis
* Real-world examples of how Zhihu, a top Q&A platform, leveraged Alluxio’s distributed caching and data management with Ray’s scalable distributed computing to optimize their multi-cloud model training performance