Upgrade to Pro — share decks privately, control downloads, hide ads and more …

11_22_CEE_Singapore_2024_Presentation.pdf

yoheimuta
November 18, 2024

 11_22_CEE_Singapore_2024_Presentation.pdf

# Title

Conversational AI for Next-Gen Social & Entertainment Solutions
Yohei Yoshimuta, Head of Engineering, Parallel

https://webinars.agora.io/CEE2024/register

# Demo Video Link

Parallel with Realtime API Demo: https://www.youtube.com/watch?v=pWK6Dv5QKPg

# (Japanese) 発表の概要

パラレルアプリでは、複数ユーザーとAIがリアルタイムで会話できるよう、Agoraの音声ミキシング技術とOpenAIのリアルタイムAPIを組み合わせたシステムを構築しました。Agoraの音量イベントを活用して各話者を識別し、個々のユーザーの趣味や嗜好に合わせた応答を生成することで、グループ通話でもパーソナライズされた会話体験を提供しています。また、オセロゲームの進行状況をリアルタイムで把握し、AIによる1vs1のゲーム実況にも対応しました。この際、盤面全体ではなく各手ごとの情報を伝達することで、より精度の高い実況を実現しています。さらに、WhisperエンジンによるWakeワード検出を実装し、ユーザーが音声コマンドでAIを起動・停止しつつ、セッションを継続できる設計を採用しています。実運用にはWakeワード機能は専用のモデルに置き換える必要がありますが、Realtime APIの応答としてWhisperエンジンで書き起こしたテキストを直接受け取れるため、今回のようにプロトタイプを高速に作成する際には便利です。

また、管理画面からプロンプトや命令を柔軟に調整できるようにし、実現したいシナリオに沿った最適な応答を設定できるよう工夫しました。これにより、システムの調整を素早く行い、反復的なイテレーションを高速に回すことが可能になりました。最後に、AIの声の調子が単調になりがちな点については、トーン調整が可能になるような改善を期待しており、今後のOpenAI技術のさらなる進化に期待しています。

yoheimuta

November 18, 2024
Tweet

More Decks by yoheimuta

Other Decks in Technology

Transcript

  1. Copyright © Parallel Inc. All Rights Reserved.
 • X /

    GitHub: @yoheimuta
 
 • Recent Focus: SRE, MySQL, Vitess, Flink 
 
 • Career
 ◦ Mobile Game : Worked on SNS game launches at GREE 
 ◦ AdTech: Built the LINE Ads Platform at FreakOut and LINE 
 Yohei Yoshimuta 
 Head of Engineering, Parallel
 Speaker 

  2. Copyright © Parallel Inc. All Rights Reserved.
 Our goal 


    Enjoying It with Friends 
 Enjoying Entertainment Alone 
 Parallel brings friends together for shared experiences, aiming to be a central hub for all types of entertainment in a world where affordable, segmented content often drives solo consumption. 

  3. Copyright © Parallel Inc. All Rights Reserved.
 About Parallel: Your

    Hangout Place 
 Parallel is a hangout app where you can hop on, see close friends already online, join them instantly, and enjoy content together, both in and outside the app. 
 

  4. Copyright © Parallel Inc. All Rights Reserved.
 Achievements 
 With

    MAU in Japan now reaching levels similar to Twitch and Roblox, 70% Gen Z, 3-hour daily call times, over $20 million raised, and all with a team of just 16. 
 # of Teams 6
  5. Copyright © Parallel Inc. All Rights Reserved.
 How Parallel Embraces

    AI 
 Parallel’s key feature is being a space where young people spend hours with friends, now enhanced with AI that joins voice chats to act as a gaming partner, study buddy and more! 
 
 
 Users ..
 
 - hang out with friends in groups 
 
 - spend around 180 mins a day on voice calls
 
 - do everything from gaming to studying and watching videos, both on and off the platform 
 
 Parallel’s standout features 
 Our AI ..
 
 - acts as a conversation partner, gaming buddy, study helper, etc 
 
 - keeps group conversations on track and offers ideas
 
 - can take on roles like a poker dealer, energize the group, spin as a DJ, or even step in as a study tutor 
 Parallel with AI 

  6. Copyright © Parallel Inc. All Rights Reserved.
 Our Initial Approach

    with Realtime API 
 With the release of the Realtime API, our initial approach was to create a demo specifically centered around gaming as an example. 
 1. We use a Wake Word to call AI whenever users need. 
 
 2. Our AI acts as a chat buddy or game partner until friends arrive. 
 
 3. Our AI mediates conversations when in a group. 
 
 4. Our AI becomes the game commentator when people playing together. 

  7. Copyright © Parallel Inc. All Rights Reserved.
 To enable AI-powered

    conversations, we have added:
 - Agora Demo Server : Manages real-time communication with the OpenAI Realtime API.
 - OpenAI Realtime API 
 System Overview 
 The core system is a group call application consisting of the following components: 
 - Parallel App
 - Parallel API Server 
 - Agora Voice Server (SaaS) 
 Existing System 
 AI-Powered Conversations 

  8. Copyright © Parallel Inc. All Rights Reserved.
 Key Challenges in

    Implementing AI-Driven Group Conversations 
 While the API supports audio and text processing, achieving the following complex features wasn’t straightforward and required thoughtful customization and innovation to make it all work seamlessly. 
 1. Real-Time Group Conversation 
 
 2. Speaker Differentiation 
 
 3. Wake Word Detection 
 
 4. Real-Time Game State Synchronization 

  9. Copyright © Parallel Inc. All Rights Reserved.
 • Solution 


    ◦ Merge individual streams into a single mixed stream on the client side.
 
 • Implementation 
 ◦ Used Agora’s onPlaybackAudioFrame to manage mixed audio streams.
 AI facilitating real-time conversations with multiple users is essential for our application. A key challenge is that the Realtime API is not designed to support multiple audio streams from different users simultaneously. 
 Real-Time Group Conversation with AI 

  10. Copyright © Parallel Inc. All Rights Reserved.
 Speaker Identification 


    • Solution 
 ◦ Use speaker volume data to distinguish individual speakers
 
 • Implementation 
 ◦ Leveraged Agora’s onAudioVolumeIndication to monitor volume levels for each participant 
 Identifying speakers and tailoring responses for each user is essential in group conversations, but the AI model can't distinguish speakers in a single mixed audio stream. 
 

  11. Copyright © Parallel Inc. All Rights Reserved.
 • Solution 


    ◦ Integrate wake word detection to activate the AI as needed.
 
 • Implementation 
 ◦ Checked audio transcriptions for wake or sleep words
 ◦ If detected, VAD (Voice Activity Detection) is turned on or off
 Wake Word Detection 
 To enable hands-free interaction, detecting wake word in real-time is crucial. However, the system needs to maintain session continuity while accurately identifying wake word. 
 
 📝VAD identifies speech in the audio stream. When off (default is on), the AI only responds when requested by the system client 

  12. Copyright © Parallel Inc. All Rights Reserved.
 Wake Word Detection

    System Flow 
 👍The Whisper Engine enables quick prototyping with real-time transcriptions. 
 🤖But a dedicated model will be needed in production for greater robustness and precision. 

  13. Copyright © Parallel Inc. All Rights Reserved.
 Real-Time Game Commentary

    
 • Solution 
 ◦ Real-time game status updates are sent from the Parallel server to the AI as user text messages via the API.
 
 • Implementation 
 ◦ Only the moves (where each player places a piece) are sent to the AI, rather than the entire board state.
 In 1v1 games, conversations often dwindle, creating awkward silences. AI commentary helps maintain engagement, but relying on audio input alone is insufficient to fully convey the dynamic game state.
 

  14. Copyright © Parallel Inc. All Rights Reserved.
 ⚠We learned that

    conveying the full board state in text led to misinterpretations. 
 ✅So we reduced the information to key moves and formatted it for clarity, using prompt engineering techniques. 
 Real-Time Game Commentary System Flow 

  15. Copyright © Parallel Inc. All Rights Reserved.
 • Rapid Prototyping

    ⚡
 ◦ 1 engineer, 2 weeks prototype
 ◦ Previously, setting up AI-driven calls required complex infrastructure, and game commentary needed intricate rule-based systems. 
 
 • Improved Tone Control 🛠
 ◦ AI responses sometimes lack tonal variety, especially in commentary, missing subtle cues like excitement or surprise. 
 ◦ The recently added voices bring impressive tonal variety, offering natural and expressive delivery 🤯.
 ◦ Yet we’re still learning to tailor the prompt effectively as they were only just released.
 Learnings and Future Enhancements 

  16. Copyright © Parallel Inc. All Rights Reserved.
 Key Takeaways 


    🧩Challenge 💡Solution ⚙Implementation 1. Real-Time Group Conversation Merge individual audio streams into a single mixed stream Used Agora’s onPlaybackAudioFrame for real-time audio management 2. Speaker Differentiation Identify active speaker based on volume data Leveraged onAudioVolumeIndication to detect and distinguish speakers 3. Wake Word Detection Enable AI activation via wake words Used Whisper Engine for transcription and VAD for response control 4. Real-Time Game State Synchronization Update AI with individual moves instead of full board Sent each move as a user message, refined through prompt engineering Scan the QR code to view the full presentation