Architecting and Building a K8s-based AI Platform

qaware.de Architecting and Building a K8s-based AI Platform Mario-Leander Reimer
[email protected] @LeanderReimer @qaware #CloudNativeNerd #gerneperdude

2 Mario-Leander Reimer Managing Director | CTO @LeanderReimer #cloudnativenerd #qaware
#gernperDude

Platform engineering is the discipline of designing and building toolchains
and workﬂows that enable self-service capabilities for software engineering organizations in the cloud-native era. Platform engineers provide an integrated product most often referred to as an “Internal Developer Platform” covering the operational necessities of the entire lifecycle of an application. https://platformengineering.org/blog/what-is-platform-engineering

A platform consists of different conceptual components. Depending on the
stakeholders and their use cases. Developer Control Plane Integration and Delivery Plane Monitoring and Logging Plane Security Plane IDE Service Catalog / API Catalog Developer Portal Application Source Code Infrastructure & Platform Source Code Observability Secrets & Identity Manager CI Pipeline Registry CD Pipeline Resource Plane Compute Data Integration Networking Platform Orchestrator Certificates & Encryption GitOps https://humanitec.com/reference-architectures

Why do we need an AI platform?

"According to Gartner, 80% of PoCs fail on their way
into productive use." https://www.qaware.de/ki-vom-proof-of-concept-poc-zur-entwicklung/

The 80% Fallacy of AI projects. 7 QAware Juan Pablo
Bottaro, LinkedIn Engineering Blog

Key challenges: technology, models and tools, scaling. Source: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year ▪
Different challenges are seen depending on the maturity of the group ▪ AI newcomers often underestimate the complexity of technologies, models and tools ▪ Production and scaling challenges often hinder production readiness ▪ High cognitive load and lack of expertise are also drivers for failing projects 8

Chatbots and AI assistants: The more speciﬁc the use case,
the more complex it becomes. ChatGPT or comparable with world knowhow ChatGPT with organisational context knowledge Specialized AI Assistent ▪ Retrieval Augment Generation ▪ Transfer Learning ▪ Specially trained model ▪ Hyper Automation Complexity Beneﬁt ▪ Easy to realise and relatively cost-efficient ▪ Requires data protection and compliance guidelines 9 QAware

Our proposal for an AI Platform Architecture

Integration & Delivery Plane Service Plane Quality Plane Data Plane
Platform Plane Observability Operability Resource Plane User Serving Plane Access Plane / APIs Orchestration Plane Data Modelling Plane Model Plane Compliance Plane Compute Data Integration Security Delivery FinOps

lreimer/k8s-native-ai-platform lreimer/k3s-ai-platform

The Kubernetes cluster topology requires precise planning. Otherwise the costs
will go through the roof! 13 QAware ▪ There are different GPU machines ▪ Not all types are available in all regions ▪ Prices vary drastically, accurate research is recommended ▪ Additional local SSDs are recommended ▪ To be decided: – all nodes with GPU – different nodes optimised for normal as well as GPU workloads https://cloud.google.com/compute/gpus-pricing?hl=de#other-gpu-models

Compliance Plane Integration & Delivery Plane Service Plane Platform Plane
Operability Resource Plane Compute Data: Local SSD Integration Security Delivery FinOps Quality Plane Data Plane Model Plane User Serving Plane Access Plane Data Modelling Pl.

Architecting and Building a K8s-based AI Platform

Architecting and Building a K8s-based AI Platform

M.-Leander Reimer PRO

More Decks by M.-Leander Reimer

Other Decks in Technology

Featured

Transcript

qaware.de Architecting and Building a K8s-based AI Platform Mario-Leander Reimer

2 Mario-Leander Reimer Managing Director | CTO @LeanderReimer #cloudnativenerd #qaware

Platform engineering is the discipline of designing and building toolchains

A platform consists of different conceptual components. Depending on the

Why do we need an AI platform?

"According to Gartner, 80% of PoCs fail on their way

The 80% Fallacy of AI projects. 7 QAware Juan Pablo

Key challenges: technology, models and tools, scaling. Source: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year ▪

Chatbots and AI assistants: The more speciﬁc the use case,

Our proposal for an AI Platform Architecture

Integration & Delivery Plane Service Plane Quality Plane Data Plane

lreimer/k8s-native-ai-platform lreimer/k3s-ai-platform

The Kubernetes cluster topology requires precise planning. Otherwise the costs

Compliance Plane Integration & Delivery Plane Service Plane Platform Plane

Compliance Plane Integration & Delivery Plane Service Plane Platform Plane

QAware GmbH | Aschauer Straße 30 | 81549 München |