Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SDD 2026: Al Goes Local - Why the Future of Int...

SDD 2026: Al Goes Local - Why the Future of Intelligent Software Runs On-Device

Generative AI has transformed how we think about building software – but the next major shift is already underway: intelligence is moving out of the cloud and onto our own devices. Across industries such as healthcare, manufacturing, automotive, finance, energy and the public sector, organisations are discovering that cloud-dependent AI cannot meet critical requirements around privacy, latency, reliability, regulation or cost. At the same time, the economics and physics of computation are shifting: local inference reduces operational cost, avoids network round-trips, is dramatically more energy-efficient, and aligns with the natural principle of data gravity – processing data where it is created instead of continuously shipping it elsewhere.

After years shaped by cloud-centric AI from OpenAI, Microsoft, Google and Amazon, the industry is now shifting toward on-device intelligence – powered by hardware from Apple, Qualcomm, Intel, AMD and NVIDIA, and by the corresponding local inference runtimes. Meanwhile, modern Small Language Models, Vision-Language Models, multimodal systems and specialised AI agents have become efficient enough to run locally on servers, desktops, laptops, phones, browsers and even edge hardware – enabled by a new hardware renaissance of GPUs, NPUs, unified memory architectures and optimised runtimes. Local AI is steadily becoming the technical baseline for intelligent, domain-specific applications.

This keynote explores why this shift is happening now – and what it means for developers and architects. Christian will show how local AI delivers fast response times, offline resilience and true data sovereignty; how hybrid local–cloud architectures are evolving to combine on-device intelligence with cloud-scale capabilities; and how lightweight fine-tuning and model adaptation techniques enable teams to specialise models for their own domains, workflows and compliance needs – often directly on their own hardware. He also highlights how Local AI brings back model ownership and lifecycle control, allowing teams to treat models as part of their core engineering assets rather than external APIs. The result is AI that finally fits the real-world constraints of vertical industries instead of forcing them to adapt to cloud limitations.

With practical examples, architectural clarity and a forward-looking perspective, Christian presents a grounded vision of the emerging Post-Cloud era of AI – one where intelligence runs where data is created, where systems remain robust even offline, where regulatory demands are met by design, where cost and energy consumption become sustainable, and where developers regain the power to build truly intelligent and sovereign software systems.

Avatar for Christian Weyer

Christian Weyer PRO

May 12, 2026

More Decks by Christian Weyer

Other Decks in Programming

Transcript

  1. User Asks the questions Admin / Trainer Fine-tunes, evals, curates

    knowledge Five-Model Local AI Agent Privacy-preserving agentic AI. Five specialized SLMs. All inference local. SLM Inference Servers llama-server processes. OpenAI-compatible APIs. Base and fine-tuned, side by side. Speech-to-Text whisper.cpp. Metal / CUDA. Optional — voice pipeline. Text-to-Speech Piper TTS, ONNX / CPU. EN + DE voices. Optional — voice pipeline. Cloud LLM (GPT-5.4) User-approved escalation only. Data leaves the machine on consent. Queries via Observatory, CLI, iPhone, or REST Fine-tuning · evals · knowledge base OpenAI-compatible API Transcription (optional) Synthesis (optional) HITL escalation, user-approved [HTTPS] Legend person system external person external system
  2. # Real components — the deterministic 35% intent_classifier_logreg.LogRegIntentClassifier # embed

    + LogReg, ~10 ms total query_decomposer.MULTI_STEP_PATTERNS # 16 regex — one step or two? query_decomposer.decompose # JSON parse fallback (line 130) intent_classifier.looks_like_injection # 30 regex — prompt-injection guard sqlite3.execute # 1 ms — no model beats stdlib
  3. embeddings.search(query) tools=[sql_query, calc] plan → step → step → answer

    sql_builder.py · regex prompts/intent.md scenarios/nextera.json
  4. Base Model Servers Fine-Tuned Servers gemma3-1b-it :9090 qwen3.5-4b :9091 embeddinggemma-308m

    :9092 gemma3-4b-vision :9093 gemma3-ft :9094 qwen3.5-4b-ft :9095 embeddinggemma-ft :9096 SmallLanguageModelClient POST /models/swap ~100ms swap time. No restart. Health-check before flip. Vision model shared across modes. GLM-OCR :9098 Upload-time only. Not part of swap. swap_urls() mode: base mode: base mode: base mode: finetuned mode: finetuned mode: finetuned shared (no FT variant) curl -X POST :8000/models/swap -d '{"mode":"ft"}'
  5. // scenarios/nextera.json (excerpt) { "name": "nextera", "language": "en", "paths": {

    "db": "./data/business.db", … }, "models": { "inference_ft": "gemma3-ft-nextera.gguf", … }, "sql": { "allowed_tables": ["products", "customers", "sales", "competit } // scenarios/logistics.json (excerpt) { "name": "logistics", "language": "de", "paths": { "db": "./data/logistics.db", … }, "models": { "inference_ft": "gemma3-ft-logistics.gguf", … }, "sql": { "allowed_tables": ["einheiten", "ausrüstung", "ersatzteile", " }
  6. Local Path POST /query agent.process(query) Full local pipeline AgentResponse ConfidenceRouter

    8-factor heuristic scoring score >= 0.6? yes no Return response confidence: 0.85 Return response + should_escalate: true + confidence: 0.42 Observatory UI shows escalation banner Cloud Escalation (HITL) User clicks "Escalate to GPT-5.4"? no yes Keep local response POST /escalate network online + key configured? no yes Blocked 403 / 503 GPT-5.4 API call Data leaves machine cloud_bytes_sent += payload Return cloud response + model badge + latency_ms + cost