SDD 2026: Al Goes Local - Why the Future of Intelligent Software Runs On-Device

User Asks the questions Admin / Trainer Fine-tunes, evals, curates
knowledge Five-Model Local AI Agent Privacy-preserving agentic AI. Five specialized SLMs. All inference local. SLM Inference Servers llama-server processes. OpenAI-compatible APIs. Base and ﬁne-tuned, side by side. Speech-to-Text whisper.cpp. Metal / CUDA. Optional — voice pipeline. Text-to-Speech Piper TTS, ONNX / CPU. EN + DE voices. Optional — voice pipeline. Cloud LLM (GPT-5.4) User-approved escalation only. Data leaves the machine on consent. Queries via Observatory, CLI, iPhone, or REST Fine-tuning · evals · knowledge base OpenAI-compatible API Transcription (optional) Synthesis (optional) HITL escalation, user-approved [HTTPS] Legend person system external person external system

# Real components — the deterministic 35% intent_classifier_logreg.LogRegIntentClassifier # embed
+ LogReg, ~10 ms total query_decomposer.MULTI_STEP_PATTERNS # 16 regex — one step or two? query_decomposer.decompose # JSON parse fallback (line 130) intent_classifier.looks_like_injection # 30 regex — prompt-injection guard sqlite3.execute # 1 ms — no model beats stdlib

embeddings.search(query) tools=[sql_query, calc] plan → step → step → answer
sql_builder.py · regex prompts/intent.md scenarios/nextera.json

# finetune/train_qwen35_toolcalling.py peft_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05,
task_type="CAUSAL_LM", ) # 1,372 examples · 3 epochs · ~6 min on RTX

Base Model Servers Fine-Tuned Servers gemma3-1b-it :9090 qwen3.5-4b :9091 embeddinggemma-308m
:9092 gemma3-4b-vision :9093 gemma3-ft :9094 qwen3.5-4b-ft :9095 embeddinggemma-ft :9096 SmallLanguageModelClient POST /models/swap ~100ms swap time. No restart. Health-check before flip. Vision model shared across modes. GLM-OCR :9098 Upload-time only. Not part of swap. swap_urls() mode: base mode: base mode: base mode: finetuned mode: finetuned mode: finetuned shared (no FT variant) curl -X POST :8000/models/swap -d '{"mode":"ft"}'

// scenarios/nextera.json (excerpt) { "name": "nextera", "language": "en", "paths": {
"db": "./data/business.db", … }, "models": { "inference_ft": "gemma3-ft-nextera.gguf", … }, "sql": { "allowed_tables": ["products", "customers", "sales", "competit } // scenarios/logistics.json (excerpt) { "name": "logistics", "language": "de", "paths": { "db": "./data/logistics.db", … }, "models": { "inference_ft": "gemma3-ft-logistics.gguf", … }, "sql": { "allowed_tables": ["einheiten", "ausrüstung", "ersatzteile", " }

Local Path POST /query agent.process(query) Full local pipeline AgentResponse ConfidenceRouter
8-factor heuristic scoring score >= 0.6? yes no Return response confidence: 0.85 Return response + should_escalate: true + confidence: 0.42 Observatory UI shows escalation banner Cloud Escalation (HITL) User clicks "Escalate to GPT-5.4"? no yes Keep local response POST /escalate network online + key configured? no yes Blocked 403 / 503 GPT-5.4 API call Data leaves machine cloud_bytes_sent += payload Return cloud response + model badge + latency_ms + cost

bench.py scripts/benchmark.py

llama-server

SDD 2026: Al Goes Local - Why the Future of Int...

SDD 2026: Al Goes Local - Why the Future of Intelligent Software Runs On-Device

Christian Weyer PRO

More Decks by Christian Weyer

Other Decks in Programming

Featured

Transcript

User Asks the questions Admin / Trainer Fine-tunes, evals, curates

# Real components — the deterministic 35% intent_classifier_logreg.LogRegIntentClassifier # embed

embeddings.search(query) tools=[sql_query, calc] plan → step → step → answer

# finetune/train_qwen35_toolcalling.py peft_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05,

Base Model Servers Fine-Tuned Servers gemma3-1b-it :9090 qwen3.5-4b :9091 embeddinggemma-308m

// scenarios/nextera.json (excerpt) { "name": "nextera", "language": "en", "paths": {

Local Path POST /query agent.process(query) Full local pipeline AgentResponse ConﬁdenceRouter

bench.py scripts/benchmark.py

llama-server