Upgrade to Pro — share decks privately, control downloads, hide ads and more …

„Hallo, KI!?“ – Realtime-Interaktionen mit Lang...

Avatar for Christian Liebel Christian Liebel
May 14, 2025
18

„Hallo, KI!?“ – Realtime-Interaktionen mit Language Models

Large Language Models (LLMs) haben die Art und Weise verändert, wie wir Software konzipieren: Statt Klicks und GUI dominiert plötzlich natürliche Sprache. Und das nicht nur über Tastatur und Text – sondern auch per Stimme. In diesem Webinar erfahren Sie, wie Sie sprachfähige KI-Modelle direkt in Ihre Anwendung integrieren und sie in Echtzeit mit Ihrer Stimme steuern können.
Dank Realtime API, WebRTC und Tool Calling entstehen neue Möglichkeiten für natürlichsprachliche Interfaces – mit minimaler Latenz, bidirektional, multilingual.

Christian Liebel von Thinktecture zeigt in praxisnahen Demos, wie Sie ausgewählte LLMs per Sprache ansprechen, eigene Funktionalitäten damit verknüpfen und smarte, gesprächige Interfaces entwickeln. Das ist keine Science-Fiction.

Achtung: Interaktive Sprach-KIs können süchtig machen.

Avatar for Christian Liebel

Christian Liebel

May 14, 2025
Tweet

More Decks by Christian Liebel

Transcript

  1. Hello, it’s me. „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Christian

    Liebel X: @christianliebel Bluesky: @christianliebel.com Email: christian.liebel @thinktecture.com Angular, PWA & Generative AI Slides: thinktecture.com /christian-liebel
  2. Overview „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Generative AI Text

    OpenAI GPT Mistral … Audio/Music Musico Soundraw … Images DALL·E Firefly … Video Sora Runway … Speech Whisper tortoise-tts …
  3. Overview „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Generative AI Text

    OpenAI GPT Mistral … Audio/Music Musico Soundraw … Images DALL·E Firefly … Video Sora Runway … Speech Whisper tortoise-tts …
  4. Overview – Process speech input and output natively (transcription optional)

    – Bidirectional speech conversations with minimum latency – Multiple languages and output voices are supported – Tool/function calling are supported – Voice Activity Detection (VAD) activated automatically (model waits for a period of silence before responding) – Model can be interrupted „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Realtime Models
  5. Use Cases – Natural language interfaces – Smart Form Filling

    – Navigation – Voice assistants – Phone agents – Alternative input methods for accessibility (ticket machines) „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Realtime Models
  6. Models OpenAI Realtime API (Beta) – GPT-4o Realtime – GPT-4o

    mini Realtime Gemini Live API (Preview) – Gemini 2.0 Flash Live „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Realtime Models
  7. APIs OpenAI Realtime API (Beta) – 57+ languages – Supports

    speech and text input – Supports speech and text output – Supports WebRTC and WebSockets – No JS SDK yet, WebRTC integration is ~50 LOC Gemini Live API (Preview) – 40+ lanugages – Supports speech, text and video input – Supports speech and text output – Supports WebSockets – No JS SDK yet, integration is ~1300 LOC „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Realtime Models
  8. Web Real-Time Communication – JavaScript API for real-time audio/video communication

    – Supports data channels for data transfer – Used by Google Meet, Microsoft Teams (web), … – W3C Recommendation (web standard) – Supported by all major browsers for several years (Chrome 27, Edge 15, Safari 11, Firefox 22) https://webrtc.org/ „Hallo, KI!?“ Realtime-Interaktionen mit Language Models WebRTC
  9. getUserMedia() – JavaScript APIs for accessing media devices – Captures

    video and/or audio input – W3C Candidate Recommendation – Supported by all major browsers for several years (Chrome 21, Edge 12, Safari 11, Firefox 17) „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Media Capture & Streams API
  10. OpenAI Realtime API (Beta) // Create a peer connection const

    pc = new RTCPeerConnection(); // Set up to play remote audio from the model const audioEl = document.createElement("audio"); audioEl.autoplay = true; pc.ontrack = e => audioEl.srcObject = e.streams[0]; // Add local audio track for microphone input in the browser const ms = await navigator.mediaDevices.getUserMedia({ audio: true }); pc.addTrack(ms.getTracks()[0]); „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Code Example (1/3)
  11. OpenAI Realtime API (Beta) // Set up data channel for

    sending and receiving events const dc = pc.createDataChannel("oai-events"); dc.addEventListener("message", (e) => { // Realtime server events appear here! console.log(e); }); „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Code Example (2/3)
  12. OpenAI Realtime API (Beta) const baseUrl = "https://api.openai.com/v1/realtime"; const model

    = "gpt-4o-realtime-preview-2024-12-17"; const sdpResponse = await fetch(`${baseUrl}?model=${model}`, { method: "POST", body: offer.sdp, headers: { Authorization: `Bearer ${EPHEMERAL_KEY}`, "Content-Type": "application/sdp" }, }); const answer = { type: "answer", sdp: await sdpResponse.text() }; await pc.setRemoteDescription(answer); „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Code Example (3/3)
  13. Session „Hallo, KI!?“ Realtime-Interaktionen mit Language Models OpenAI Realtime API

    (Beta) https://platform.openai.com/docs/guides/realtime-conversations#realtime-speech-to-speech-sessions (13.05.2025)
  14. Session events „Hallo, KI!?“ Realtime-Interaktionen mit Language Models OpenAI Realtime

    API (Beta) session.update Client Server session.created Session initialized with default values. Update session voice, modalities, tools, turn detection. session.updated Session updated.
  15. Session events „Hallo, KI!?“ Realtime-Interaktionen mit Language Models OpenAI Realtime

    API (Beta) https://platform.openai.com/docs/api-reference/realtime-client-events/session/update (13.05.2025)
  16. Input audio buffer events (selection) „Hallo, KI!?“ Realtime-Interaktionen mit Language

    Models OpenAI Realtime API (Beta) input_audio_buffer.speech_stopped Client Server input_audio_buffer.speech_started Server has detected speech. Server has detected end of speech. input_audio_buffer.committed Session has committed input buffer and will create conversation item.
  17. Conversation events „Hallo, KI!?“ Realtime-Interaktionen mit Language Models OpenAI Realtime

    API (Beta) conversation.item.create Client Server conversation.item.created Input audio buffer has been committed, client has sent a conversation item, or the server is generating a response. Create a conversation item programmatically (e.g., from text input).
  18. Response events (selection) „Hallo, KI!?“ Realtime-Interaktionen mit Language Models OpenAI

    Realtime API (Beta) response.audio.done Client Server response.created Response created, generation in progress. Response audio generated. Also called when a response is interrupted. response.done Response is done streaming.
  19. Let’s change the world Tools/Function calling can be used to…

    – extend the model’s knowledge by accessing custom data (customer data, articles, orders, wikis, postcode API, …) – extend the model’s capabilities by executing real-world actions (navigate, send an SMS, update order status in a database, fill in a form, perform a web search, …) „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Tool/Function Calling
  20. Foundation of Agentic AI ReAct & Tools & MCP „Hallo,

    KI!?“ Realtime-Interaktionen mit Language Models Tool/Function Calling https://mcp.so/
  21. Tool/function calling – OpenAI Realtime API supports adding tools at

    response (response.create) or session level (session.update) – When processing input, the model determines if it should call one of the present functions – The function must be executed by the client – Once the function has been executed, the client can create a new conversation item with the result of the function call (“return value”) „Hallo, KI!?“ Realtime-Interaktionen mit Language Models OpenAI Realtime API (Beta)
  22. Tool/Function calling events (selection) „Hallo, KI!?“ Realtime-Interaktionen mit Language Models

    OpenAI Realtime API (Beta) response.done Client Server session.update Set available functions. Contains the function call. conversation.item.create Provide the result of a function call.
  23. Pricing „Hallo, KI!?“ Realtime-Interaktionen mit Language Models OpenAI Realtime API

    (Beta) https://community.openai.com/t/estimate-the-cost-for-1-min-usage-of-real-time-api/1019290/6 (13.05.2025)
  24. – Realtime models unlock new, exciting opportunities for natural language

    interfaces beyond chat boxes – Bidirectional, multilingual, minimum latency – All available models (OpenAI Realtime/Gemini Live) in beta or preview – Quality is good, but not perfect – Pricing seems quite high – Fun! – No science fiction, try it today! „Hallo, KI!?“ Realtime-Interaktionen mit Language Models Summary