OpenShift Commons Gathering Chicago 2023 - Case Study: OpenShift AI at NASA
Carlos Costa (IBM), Hongchao Deng (Anyscale), and Alex Corvin (Red Hat) present at the OpenShift Commons Gathering Co-Located with KubeCon + CloudNativeCon North America 2023.
to latency, throughput, power, Data preparation Workflow of steps (e.g. remove hate and profanity, deduplicate, etc) Model adaptation Model tuning with custom data set for downstream tasks Distributed training Long-running job on massive infrastructure Model creation … … deployment Public clouds on-prem Public clouds on-prem Public clouds on-prem Edge Public clouds on-prem
Serving Optimization AI/ML pipelines Making it easier to scale and orchestrate today’s core AI building blocks… ...making it possible what is next Toda y Tomorro w Larger models (billions parameters) SCAL E More complex adaptation pipelines From narrow to broad and more reusable AI
Generation Inference Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving OpenShift AI Midstream delivery in Open Data Hub self-managed, self-deploy platform
Generation Inference Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving Simplified User experience with CodeFlare SDK intuitive, easy-to-use python interface for batch resource requesting, access and job submission Enhanced interactivity, logging and observability for AI/ML jobs on OpenShift Advanced Kubernetes-native Resource Management Multi-Cluster App Dispatcher (MCAD) enabling job queueing, meta-scheduling, prioritization and quota management InstaScale providing o-demand cluster scaling Integrated support for TorchX and KubeRay Scalable, efficient pre-processing, training and validation Scale out, distributed GPU-based training and fine tuning with PyTorch and Ray
Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving User experience and performance Performant and efficient inference SDK and developer experience Tuning - vanilla prompt tuning and multi-prompt tuning Model scaling, GPU sharing, model placement Coming LoRA & emerging variants, Output filtering: HAP, PII Model chaining/composition
value watsonx.ai Models Suite of IBM trained foundation models Tune and infer Studio Model serving Train and validate Pre- processing Validation Model training Hybrid cloud platform OpenShift AI InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX KServe TGIS Optimized Text Generation Inference Server Caikit Runtime Dev APIs, Prompt tuning, inference
data/ML scientists to focus on computation while infra engineers concentrate on Kubernetes. 12 Ray Kubernetes infra engineers: Integrate KubeRay with Kubernetes ecosystem tools, e.g. Prometheus, Grafana, and Nginx. Data/ML scientists: Develop Python scripts. KubeRay Read Create / Delete Pods Update (observability) Health check Monitoring Scaling requests for Tasks / Actors Create / Update
Ray Contributed workflow DAG generation under ray-project Evolving API server Integration with K8s native job scheduler MCAD integration with KubeRay Workflow DAG generation Ray Workflows CodeFlare SDK CodeFlare Project KubeRay Use cases in new domains leading to contributions across the stack and increasing mind share model training quantum chip design Current focus Open source collaboration with
life cycle of foundation model training Pre-train Fine tune Inferenc e on-prem + Foundation Models @ NASA OpenShift AI + NASA and IBM have teamed up to create an AI Foundation Model for Earth Observations, using large-scale satellite and remote sensing data, including the Harmonized Landsat and Sentinel-2 (HLS) data
pre-processing, training, fine tuning Pre-trai n IBM Cloud Fine tune OpenShift AI IBM Cloud Inferenc e geospatial model fine tuned model IBM Cloud (Vela) AW S InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRa y TORCHX
KubeRay MCAD integration and hardened OpenShift support Advanced job and configuration templates for foundation model jobs Automated deployment, job launching, and enhanced observability Advanced fault-recovery and requeuing