Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINEヤフーの音声AIがもたらす未来:ASR/TTSと対話技術の新たな可能性 / LY Co...

LINEヤフーの音声AIがもたらす未来:ASR/TTSと対話技術の新たな可能性 / LY Corporation's Speech AI Vision: Towards Realtime Spoken Dialogue through Advanced ASR and TTS

LINEヤフーの音声認識と音声合成技術を活用した応用事例と、近年注目されているLLM基盤のリアルタイム音声対話技術の自社の取り組みについて紹介します。

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. LY Corporation's Speech AI Vision: Towards Realtime Spoken Dialogue through

    Advanced ASR and TTS Speech and Acoustic AI Dept., Data Science Group Jumpei Miyake, Taiki Kinoshita LINEヤフーの音声AIがもたらす未来:ASR/TTSと対話技術の新たな可能性
  2. Agenda - LY Corporation’s Speech AI LINEヤフーの音声AIについて - LY Corporation’s

    ASR/TTS LINEヤフーの音声認識・音声合成の紹介 - Real-time Speech-to-Speech リアルタイムSpeech-to-Speech技術開発の取り組みについて - Future Works 今後の展望
  3. LY Corporation’s Speech AI LINEヤフーの音声AIについて Video and Audio Content Analysis

    Speech Recognition Speech Generation Video/Audio Contents Call Center Meeting Voice User Interface Video/Audio Content and Call Analysis 写真素材提供:アフロ
  4. YJVOICE: Streaming Speech Recognition YJVOICE : ストリーミング音声認識 Efficiently adapts to

    target domains • A strategy based on compact models without external language models • Domain adaptation without target audio data Paired speech-text data Unpaired text data Base Model Adaptation Model Speech Text Text Boosts phrase with user dictionaries Speech Recognition Would you like to start the navigation via this route? Service-Specific Dictionary "Yes" "No" "Prioritize expressways" "Prioritize general roads" × Yeast → ◦ Yes × Know → ◦ No × Prioritize general loads ↓ ◦ Prioritize general roads Resolves homonyms 日 本 橋 ニ ホ ン バ シ ま で マ デ の ノ… Surface Read Surface Read End-to-End ASR Speech • i.e. ・日本橋(ニホンバシ) is a location in Tokyo ・日本橋(ニッポンバシ) is a location in Osaka • Joint prediction of both surface and reading 同表記異音語 効率的なドメイン適応 動的ユーザ辞書によるフレーズ認識強化 ※ About Feature 2 Feature 1: High accuracy for web search and LY Corporation domain Feature 2: Resolves homonyms and customizes easily Feature 3: Provides Web API and on-device modules
  5. Achoris: Expressive Text-To-Speech Achoris : 表現力が豊かな音声合成 Feature 1: Control emotion

    intensity with 7 expression styles Feature 2: 17+ preset speaker options with human-like quality Feature 3: Provides Web API, on-device modules and editing web tools Achoris Editor : Text-to-speech editing tool Achoris : Expressive text-to-speech Control Over Speaker, Emotion, and Intensity
  6. 音声認識のサービス活用事例 • Yahoo JAPAN App’s Voice Search • Voice Search

    is implemented in most Yahoo JAPAN Services i.e. 17 services including Maps, Transit, and shopping Yahoo JAPAN App (iOS/Android) Examples of Application (1/3)
  7. 音声合成のサービス活用事例 • Navigation voice in Yahoo JAPAN Car Navigation App

    • On-Device Neural Text To Speech Engine (called “Achoris Lite”) • Generate a 10-second audio waveform in 0.2 seconds! * Real Time Factor(RTF) is 0.02 in iPhone 15 • To minimize app size, we've implemented various optimizations in both inference libraries and model size Yahoo JAPAN Car Navigation App Examples of Application (2/3)
  8. A lab app for real-time spoken dialogue based on a

    LLM (under developing) https://www.lycorp.co.jp/ja/technology-design/labs/ Examples of Application (3/3) リアルタイム音声対話の実験アプリ(開発中)
  9. Real-time Speech-to-Speech Trend リアルタイムSpeech-to-Speechの技術動向 (*1) https://openai.com/ja-JP/chatgpt/overview/ (*2) https://gemini.google/overview/gemini-live/ (*3) https://moshi.chat/

    (*4) https://nu-dialogue.github.io/j-moshi/ OpenAI, ChatGPT Advanced Voice Mode(*1) Kyutai, Moshi(*3) Nagoya univ., J-Moshi(*4) Google, Gemini Live(*2)
  10. Speech-to-Speech Architecture Speech-to-Speechのモデル構造 Large Language Model Text-Guided Speech Generation Low-latency

    speech generation using audio tokens or a streaming TTS module Audio Adapter Speech Encoder Modality alignment between text and audio Prompt Speech Single Stream: User's speech only Multi Stream: User's speech + LLM-generated speech
  11. Pros of Integrating Speech Encoder with LLMs 音声とLLMsの統合による利点 Leverage LLM

    Capabilities Prompt-Driven Flexibility Bypass ASR Errors LLMの高度な基盤能力の活用 プロンプトによる高いカスタマイズ性 音声認識誤りの影響を回避
  12. Evaluation of Task Performance Speech LLMのタスク性能の評価 input JSQuAD (Question Answer,

    char_f1) ALT (Translation from jp to en, BertScore) Ground truth text (text-to-text) 0.853 0.941 Transcribed text (text-to-text) 0.750 0.921 Speech (speech-to-text) 0.844 0.910 LLM: gemma2-2b-it ASR model: whisper small Speech Encoder: whisper small Training Toolkit: SLAM-LLM Evaluation Toolkit: llm-jp-eval SLAM-LLM: https://github.com/X-LANCE/SLAM-LLM llm-jp-eval: https://github.com/llm-jp/llm-jp-eval Comparable performance on translation task Better performance on QA task
  13. Evaluation of Inference Speed 推論速度の評価 56 17 0 10 20

    30 40 50 60 vllm slam-llm (transformers) Generated Characters per Second vLLM: Kwon, Woosuk, et al. "Efficient memory management for large language model serving with pagedattention." Proceedings of the 29th Symposium on Operating Systems Principles. 2023. Faster 3.2x faster (Generates 'How can I help you today?' in 0.45 seconds) Number of Tokens
  14. Future Works 今後の展望 We are developing... • Realtime Speech-to-Speech (integration

    with LLM) • Multilingual Speech-To-Text/Text-To-Speech Voice Control in a Car Human-like and Natural Conversational Search Spoken Dialogue via Call AI Agent OR Search? Weather? Podcast? AI Agent AI Agent AI Agent
  15. EOP