Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Hitchhiker's Guide to MLOps by David Cardozo

Avatar for GDG Montreal GDG Montreal
November 14, 2024
39

The Hitchhiker's Guide to MLOps by David Cardozo

This talk is your roadmap to mastering the three pillars of modern MLOps: Docker for containerizing model's code, Kubeflow for orchestrating pipelines, and Vertex AI for organizing. Whether you're looking to scale your models in production, this guide provides practical insights to make it happen.

https://youtu.be/mqHyNwUSGmk

DevFest Montreal 2024

Avatar for GDG Montreal

GDG Montreal

November 14, 2024
Tweet

More Decks by GDG Montreal

Transcript

  1. You’re an Online Retailer Selling Shoes ... Your model predicts

    click-through rates (CTR), helping you decide how much inventory to order
  2. What causes problems? Kinds of problems • Fast - Example:

    bad sensor, bad software update • Slow - Example: drift
  3. Sudden Problems Problem with data collection ◦ Bad sensor/camera ◦

    Bad log data ◦ Moved or disabled sensors/cameras Systems problem ◦ Bad software update ◦ Loss of network connectivity ◦ System down ◦ Bad credentials
  4. Gradual Problems Data changes ◦ Trend and seasonality ◦ Distribution

    of features changes ◦ Relative importance of features changes World changes ◦ Styles change ◦ Competitors change ◦ Business expands to other geos
  5. Why “Understand” the model? Mispredictions do not have uniform cost

    to your business. The data you have is rarely the data you wish you had. Model objective is nearly always a proxy for your business objectives Some percentage of your customers may have a bad experience The real world doesn’t stand still
  6. • Ground truth changes slowly (months, years) • Model retraining

    driven by: ◦ Model improvements, better data ◦ Changes in software and/or systems • Labeling ◦ Curated datasets ◦ Crowd-based Easy Problems
  7. • Ground truth changes faster (weeks) • Model retraining driven

    by: ◦ Declining model performance ◦ Model improvements, better data ◦ Changes in software and/or systems • Labeling ◦ Direct feedback ◦ Crowd-based Harder Problems
  8. • Ground truth changes very fast (days, hours, min) •

    Model retraining driven by: ◦ Declining model performance ◦ Model improvements, better data ◦ Changes in software and/or systems • Labeling ◦ Direct feedback ◦ Weak supervision Really Hard Problems
  9. … a production solution requires so much more Configuration Data

    Collection Data Verification Feature Extraction Process Management Tools Analysis Tools Machine Resource Management Serving Infrastructure Monitoring ML Code
  10. Production Machine Learning Modern Software Development • Scalability • Extensibility

    • Configuration • Consistency & Reproducibility • Modularity • Best Practices • Testability • Monitoring • Safety & Security Machine Learning Development • Labeled data • Feature space coverage • Minimal dimensionality • Maximum predictive data • Fairness • Rare conditions • Data lifecycle management +
  11. Continuous Training for Production ML in the TFX Platform. OpML

    (2019). Slice Finder: Automated Data Slicing for Model Validation. ICDE (2019). Data Validation for Machine Learning. SysML (2019). TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD (2017). Data Management Challenges in Production Machine Learning. SIGMOD (2017). Rules of Machine Learning: Best Practices for ML Engineering. Google AI Web (2017). Machine Learning: The High Interest Credit Card of Technical Debt. NeurIPS (2015). Leading ML best practices
  12. What is MLOps? “MLOps is a practice for collaboration and

    communication between data scientists and operations professionals to help manage production ML lifecycle.” “Similar to the DevOps or DataOps approaches, MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements.” 24 https://en.wikipedia.org/wiki/MLOps
  13. A history perspective • AlexNet 2012 ◦ Technique that let

    computer figure out the rules. ◦ Inherently parallel problem ◦ Matrix operations • GPUs for 2D Convolutions
  14. • AlexNet required 2 of these cards • Beginning of

    the GPGPU. ◦ CUDA ◦ CuDNN • The start of our conundrums!
  15. CUDA and CUDNN • cuDNN is a GPU-accelerated library of

    primitives for deep neural networks. • Convolution forward and backward • Pooling forward and backward • Softmax forward and backward • Neuron activations forward and backward: ◦ Rectified linear (ReLU) ◦ Sigmoid ◦ Hyperbolic tangent (TANH) • Tensor transformation functions
  16. It should be easy right ? Just buy a GPU

    and install CUDA and CUDNN
  17. Linus and Nvidia, they have their issues. “Near the end

    of his talk, when asked by one of the attendees about NVIDIA's hardware support and lack of open-source driver enablement / documentation, he had a few choice words for the Santa Clara company.” Link
  18. Back to modern times • Let us explore the workflow

    of generating a machine learning model from zero (DevOps perspective)
  19. • Packages up software binaries and dependencies • Isolates software

    from each other • Container is a standard format • Easily portable across environment • Allows ecosystem to develop around its standard
  20. Solving the issue of injecting GPU Devices ├─ nvidia-docker2 │

    ├─ docker-ce │ ├─ docker-ee │ ├─ docker.io (>= 18.06.0) │ └─ nvidia-container-runtime ├─ nvidia-container-runtime │ └─ nvidia-container-toolkit ├─ nvidia-container-toolkit │ └─libnvidia-container-tools ├─ libnvidia-container-tools │ └─ libnvidia-container1 └─ libnvidia-container1
  21. docker run --gpus all -it --rm tensorflow/tensorflow:2.7.0-gpu \ python -c

    "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" docker run --gpus all -it --rm tensorflow/tensorflow:2.0.0-gpu \ python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" docker run --gpus all -it --rm tensorflow/tensorflow:2.0.0-gpu \ python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" docker run --gpus all -it --rm --ipc=host \ --name mypytorchproject pytorch/pytorch:1.4-cuda10.1-cudnn7-devel
  22. Anatomy and Structure of Base Images Common guidelines NVIDA CUDA

    Tensorflow PyTorch base runtime devel Templates: 11.4.2-cudnn8-runtime-ubuntu20.04 Templates: tensorflow/tensorflow:2.7.0-gpu-jupyter GPU Jupyter CUDA CUDNN Templates: pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
  23. Considerations while building GPU Images Pinpoint your dependencies • Do

    multistage builds. • Images can grow pretty fast. FROM nvidia/cuda:10.2-cudnn7-devel AS builder • Define constraints NVIDIA_REQUIRE_CUDA "cuda>=11.0 driver>=450"
  24. Considerations while running GPU Containers docker run --rm --runtime=nvidia \

    -e NVIDIA_VISIBLE_DEVICES=2,3 \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ nvidia/cuda nvidia-smi docker run --gpus all --rm \ --ipc=host \ or --shm -v local_dir:container_dir \ nvcr.io/nvidia/pytorch:xx.xx-py3 Keep mind on using --gpus, since this will allow docker to call nvidia-docker to inject devices an environment variables Either let ipc host (so that multiple workers can communicate) or augment the shared memory.
  25. FROM nvcr.io/nvidia/pytorch:21.10-py3 RUN apt update && apt install -y zip

    htop \ screen libgl1-mesa-glx COPY requirements.txt . RUN python -m pip install --upgrade pip RUN pip uninstall -y nvidia-tensorboard nvidia-tensorboard-plugin-dlprof RUN pip install --no-cache -r requirements.txt coremltools \ onnx gsutil notebook wandb>=0.12.2 RUN pip install --no-cache -U torch torchvision numpy Pillow RUN mkdir -p /usr/src/app WORKDIR /usr/src/app COPY . /usr/src/app ADD https://ultralytics.com/assets/Arial.ttf /root/.config/Ultralytics/ Consider yolov5 Dockerfile
  26. Developing in Containers • Most tensorflow and pytorch images will

    try to run your code on the GPU if the image is specified as GPU, but they will use the CPU in case the GPU is not present (be careful about custom layers) • Also newer images of CUDA are now hosted on nvcr.io/nvidia/cuda
  27. Greek for “Helmsman”; also the root of the words “governor”

    and “cybernetic” • Manages container clusters • Inspired and informed by Google’s experiences and internal systems • Supports multiple cloud and bare-metal environments • Supports multiple container runtimes • 100% Open source, written in Go Manage applications, not machines Kubernetes
  28. kubelet UI kubelet CLI API users master(s) nodes etcd kubelet

    scheduler controllers apiserver The 10000 foot view
  29. Very High Level Architecture 63 Kubeflow Pipelines Vertex AI GCS

    BigQuery Dataflow Google Kubernetes Engine (GKE) TensorFlow JAX Pytorch
  30. Add conditional logic and branches to your pipeline Store metadata

    for every artifact produced by the pipeline Track artifacts, lineage, metrics, and execution across your ML workflow Vertex AI Pipelines 65
  31. Custom container components from kfp import dsl from kfp.dsl import

    Output, Dataset @dsl.container_component def create_dataset( text: str, output_gcs: Output[Dataset], ): return dsl.ContainerSpec( image='alpine', command=[ 'sh', '-c', 'mkdir --parents $(dirname "$1") && echo "$0" > "$1"', ], args=[text, output_gcs.path])
  32. Lightweight Python function-based components from kfp import dsl from kfp.dsl

    import Input, Output, Dataset, Model @dsl.component( base_image='python:3.9', packages_to_install=['tensorflow==2.10.0'], ) def train_model( dataset: Input[Dataset], num_epochs: int, model: Output[Model], ): from tensorflow import keras # load and process the Dataset artifact with open(dataset.path) as f: x, y = ... my_model = keras.Sequential( [ layers.Dense(4, activation='relu', name='layer_1'), layers.Dense(2, activation='relu', name='layer_2'), layers.Dense(1, name='layer_3'), ] ) my_model.compile(...) # train for num_epochs my_model.fit(x, y, epochs=num_epochs) # save the Model artifact my_model.save(model.path)