Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automating AI Infrastructure on GCP with Infras...

Ananda Dwi Ae
March 23, 2025
8

Automating AI Infrastructure on GCP with Infrastructure as Code

Ananda Dwi Ae

March 23, 2025
Tweet

More Decks by Ananda Dwi Ae

Transcript

  1. “If you do what you’ve always done, you’ll get what

    you’ve always gotten” Tony Robbins
  2. PROD AI/ML Platform Data Lake ML Metadata Management Business Use

    Case Training Operationalization Continuous training Model Deploymen t Prediction Serving Continuous Monitoring ML Development Data and Feature Management Production Applications & Processes MLOps: quick recap
  3. GCP AI Infrastructure • AI Accelerators for every use case

    from high performance training to low-cost inference • Scale faster with GPUs and TPUs on Google Kubernetes Engine or Google Compute Engine • Deployable solutions for Vertex AI, Google Kubernetes Engine, and the Cloud HPC Toolkit
  4. Challenges in Deploying and Managing AI Infrastructure • Human Error

    • Inconsistent Environments • Scalability Issues • Limited Audibility & Documentation • High Operational Overhead
  5. What is Infrastructure as Code (IaC)? Infrastructure as Code (IaC)

    is a practice where infrastructure — including servers, networks, databases, and other resources — is provisioned, managed, and configured using code instead of manual processes. It allows developers and operators to define and automate infrastructure deployment, ensuring consistency, scalability, and repeatability. Benefits?
  6. How IaC works ◦ Define: Write code (usually in languages

    like HCL, YAML, or JSON) that describes the desired infrastructure. ◦ Provision: Use an IaC tool (e.g., Terraform, Pulumi) to create or modify infrastructure. ◦ Test: Validate the infrastructure before deploying to production. ◦ Deploy: Push changes through pipelines or directly apply them. ◦ Manage: Monitor and maintain infrastructure using automated updates or scaling.
  7. Terraform for GCP Automation Service Purpose Terraform Resource Compute Engine

    GPU/CPU instances for ML model training. google_compute_instance Vertex AI Managed ML model training and deployment. google_vertex_ai_endpoint Cloud Storage Storing datasets and model artifacts. google_storage_bucket Cloud Functions Event-driven serverless computing. google_cloudfunctions_function BigQuery Data warehouse for analytics. google_bigquery_dataset Pub/Sub Messaging for pipeline automation. google_pubsub_topic Cloud Composer (Airflow) Workflow orchestration for pipelines. google_composer_environment
  8. Hands-On Lab ◦ Deploy private vertex AI Workbench Instance without

    proxy. Instance will be accessed using IAP (Identity aware proxy). ◦ The example creates: ▪ Vertex AI workbench with custom service account main.tf. ▪ Network & firewall rules network.
  9. Security Best Practices ◦ IAM and Access Management, use Principle

    of Least Privilege ◦ Rotate and secure keys ◦ Encrypt data at rest and in transit ◦ Use private networking ◦ Secure sensitive variables ◦ Implement Role-Based Access Control (RBAC) ◦ Enable audit logging ◦ Use Terraform State securely ◦ Implement Policies and Guardrails ◦ Regularly patch and update
  10. Challenges and How to Overcome Them ◦ Handling State Management

    ◦ Managing Dependencies ◦ Change Management in Production