target domains • A strategy based on compact models without external language models • Domain adaptation without target audio data Paired speech-text data Unpaired text data Base Model Adaptation Model Speech Text Text Boosts phrase with user dictionaries Speech Recognition Would you like to start the navigation via this route? Service-Specific Dictionary "Yes" "No" "Prioritize expressways" "Prioritize general roads" × Yeast → ◦ Yes × Know → ◦ No × Prioritize general loads ↓ ◦ Prioritize general roads Resolves homonyms 日 本 橋 ニ ホ ン バ シ ま で マ デ の ノ… Surface Read Surface Read End-to-End ASR Speech • i.e. ・日本橋(ニホンバシ) is a location in Tokyo ・日本橋(ニッポンバシ) is a location in Osaka • Joint prediction of both surface and reading 同表記異音語 効率的なドメイン適応 動的ユーザ辞書によるフレーズ認識強化 ※ About Feature 2 Feature 1: High accuracy for web search and LY Corporation domain Feature 2: Resolves homonyms and customizes easily Feature 3: Provides Web API and on-device modules
is implemented in most Yahoo JAPAN Services i.e. 17 services including Maps, Transit, and shopping Yahoo JAPAN App (iOS/Android) Examples of Application (1/3)
• On-Device Neural Text To Speech Engine (called “Achoris Lite”) • Generate a 10-second audio waveform in 0.2 seconds! * Real Time Factor(RTF) is 0.02 in iPhone 15 • To minimize app size, we've implemented various optimizations in both inference libraries and model size Yahoo JAPAN Car Navigation App Examples of Application (2/3)
speech generation using audio tokens or a streaming TTS module Audio Adapter Speech Encoder Modality alignment between text and audio Prompt Speech Single Stream: User's speech only Multi Stream: User's speech + LLM-generated speech
char_f1) ALT (Translation from jp to en, BertScore) Ground truth text (text-to-text) 0.853 0.941 Transcribed text (text-to-text) 0.750 0.921 Speech (speech-to-text) 0.844 0.910 LLM: gemma2-2b-it ASR model: whisper small Speech Encoder: whisper small Training Toolkit: SLAM-LLM Evaluation Toolkit: llm-jp-eval SLAM-LLM: https://github.com/X-LANCE/SLAM-LLM llm-jp-eval: https://github.com/llm-jp/llm-jp-eval Comparable performance on translation task Better performance on QA task
30 40 50 60 vllm slam-llm (transformers) Generated Characters per Second vLLM: Kwon, Woosuk, et al. "Efficient memory management for large language model serving with pagedattention." Proceedings of the 29th Symposium on Operating Systems Principles. 2023. Faster 3.2x faster (Generates 'How can I help you today?' in 0.45 seconds) Number of Tokens
with LLM) • Multilingual Speech-To-Text/Text-To-Speech Voice Control in a Car Human-like and Natural Conversational Search Spoken Dialogue via Call AI Agent OR Search? Weather? Podcast? AI Agent AI Agent AI Agent