Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecting and Building a K8s-based AI Platform

Architecting and Building a K8s-based AI Platform

Developing a scalable and production-ready AI platform poses significant challenges for organisations. In addition to a modular and flexible architecture, issues such as infrastructure automation, orchestration, model deployment and lifecycle management must be efficiently addressed. Kubernetes and open source technologies provide a powerful foundation for addressing these challenges.

In this talk, we will design a cloud-native AI platform and show how to build it step by step - both locally and in the public cloud. The focus will be on integrating Kubernetes, open source tools and GitOps to create a highly automated, repeatable and scalable environment for machine learning and AI workloads.

M.-Leander Reimer

March 13, 2025
Tweet

More Decks by M.-Leander Reimer

Other Decks in Technology

Transcript

  1. Platform engineering is the discipline of designing and building toolchains

    and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. Platform engineers provide an integrated product most often referred to as an “Internal Developer Platform” covering the operational necessities of the entire lifecycle of an application. https://platformengineering.org/blog/what-is-platform-engineering
  2. A platform consists of different conceptual components. Depending on the

    stakeholders and their use cases. Developer Control Plane Integration and Delivery Plane Monitoring and Logging Plane Security Plane IDE Service Catalog / API Catalog Developer Portal Application Source Code Infrastructure & Platform Source Code Observability Secrets & Identity Manager CI Pipeline Registry CD Pipeline Resource Plane Compute Data Integration Networking Platform Orchestrator Certificates & Encryption GitOps https://humanitec.com/reference-architectures
  3. "According to Gartner, 80% of PoCs fail on their way

    into productive use." https://www.qaware.de/ki-vom-proof-of-concept-poc-zur-entwicklung/
  4. The 80% Fallacy of AI projects. 7 QAware Juan Pablo

    Bottaro, LinkedIn Engineering Blog
  5. Key challenges: technology, models and tools, scaling. Source: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year ▪

    Different challenges are seen depending on the maturity of the group ▪ AI newcomers often underestimate the complexity of technologies, models and tools ▪ Production and scaling challenges often hinder production readiness ▪ High cognitive load and lack of expertise are also drivers for failing projects 8
  6. Chatbots and AI assistants: The more specific the use case,

    the more complex it becomes. ChatGPT or comparable with world knowhow ChatGPT with organisational context knowledge Specialized AI Assistent ▪ Retrieval Augment Generation ▪ Transfer Learning ▪ Specially trained model ▪ Hyper Automation Complexity Benefit ▪ Easy to realise and relatively cost-efficient ▪ Requires data protection and compliance guidelines 9 QAware
  7. Integration & Delivery Plane Service Plane Quality Plane Data Plane

    Platform Plane Observability Operability Resource Plane User Serving Plane Access Plane / APIs Orchestration Plane Data Modelling Plane Model Plane Compliance Plane Compute Data Integration Security Delivery FinOps
  8. The Kubernetes cluster topology requires precise planning. Otherwise the costs

    will go through the roof! 13 QAware ▪ There are different GPU machines ▪ Not all types are available in all regions ▪ Prices vary drastically, accurate research is recommended ▪ Additional local SSDs are recommended ▪ To be decided: – all nodes with GPU – different nodes optimised for normal as well as GPU workloads https://cloud.google.com/compute/gpus-pricing?hl=de#other-gpu-models
  9. Compliance Plane Integration & Delivery Plane Service Plane Platform Plane

    Operability Resource Plane Compute Data: Local SSD Integration Security Delivery FinOps Quality Plane Data Plane Model Plane User Serving Plane Access Plane Data Modelling Pl.
  10. Compliance Plane Integration & Delivery Plane Service Plane Platform Plane

    Operability Resource Plane Compute Data: Local SSD Integration Security Delivery FinOps Quality Plane Data Plane Model Plane User Serving Plane Access Plane Data Modelling Pl.
  11. QAware GmbH | Aschauer Straße 30 | 81549 München |

    GF: Dr. Josef Adersberger, Michael Stehnken, Michael Rohleder, Mario-Leander Reimer Niederlassungen in München, Mainz, Rosenheim, Darmstadt | +49 89 232315-0 | [email protected] Thank you!