Upgrade to Pro — share decks privately, control downloads, hide ads and more …

【輪講資料】Moshi: a speech-text foundation model
for...

【輪講資料】Moshi: a speech-text foundation model
for real-time dialogue

リアルタイム音声対話モデル Moshi を提案した論文の紹介資料です

Avatar for Hayato Tsukagoshi

Hayato Tsukagoshi

July 15, 2025
Tweet

More Decks by Hayato Tsukagoshi

Other Decks in Research

Transcript

  1. Moshi: a speech-text foundation model
 for real-time dialogue Alexandre Défossez,

    Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour https://arxiv.org/abs/2410.00037 Nagoya Univ. D3, Hayato Tsukagoshi
  2. •Full-duplexͳϦΞϧλΠϜର࿩Ϟσϧ Moshi ΛఏҊ͢Δ࿦จ • ϢʔβͷԻ੠Λฉ͖ͳ͕Βಉ࣌ʹϞσϧ͕ग़ྗͰ͖Δ • 㱻 half-duplex: ยํ͕࿩ͯ͠Δؒɺ΋͏ยํ͸࿩ͤͳ͍ •ϑϥϯεͷύϦΛڌ఺ͱ͢ΔඇӦརݚڀॴ

    Kyutai ͷݚڀ •పఈతʹετϦʔϛϯάॲཧΛҙࣝͨ͠ΞʔΩςΫνϟ͕ಛ௃ • ϢʔβԻ੠ɾϞσϧԻ੠ɾϞσϧςΩετΛಉ࣌ʹϞσϧ΁ೖྗ •χϡʔϥϧԻ੠ίʔσοΫ Mimi ΋։ൃͯ͠׆༻ • 24000HzͷԻ੠Λ12.5HzͷτʔΫϯྻʹτʔΫφΠζ͢Δ ֓ཁ 2
  3. •ࣗݾճؼܕTransformerϕʔεͷ7BϞσϧ + Ի੠τʔΫφΠβ •Ի੠τʔΫφΠβ Mimi ʹΑΓԻ੠ΛIDྻʹม׵͠཭ࢄతʹѻ͏ • frame rate (1ඵ͋ͨΓͷσʔλྔ)

    ͸ 12.5 •ೖྗ: ϢʔβͷԻ੠ɺϞσϧͷԻ੠ɺςΩετ (inner monologue) • ͦΕͧΕʹରԠ͢ΔϕΫτϧΛͨ͋͠ΘͤͯTransformerʹೖྗ Moshiͷߏ੒ 5
  4. •MoshiΛࢧ͑Δج൫ٕज़ͷҰͭɺ96.2MͰConvͱTransformer͔ΒͳΔ (hf) • 80msΛ1 tokenͱͯ͠ѻ͍ɺೖྗαϯϓϦϯάϨʔτ͸24000Hz •Ի੠೾ܗΛ཭ࢄతͳAudio tokenʹม׵͢ΔNeural Audio Codec •

    VQ-VAEͰ஌ΒΕΔdiscrete bottleneckΛ࠾༻ •Audio token͸Acoustic TokenͱSemantic Tokenͷ2छྨ͕ग़ྗ • Semantic Token: Ի੠ͷҙຯతɾԻӆత৘ใΛଊ͑Δ • WavLM ͷຒΊࠐΈදݱΛৠཹ • Acoustic Token: ࡉ͔ͳԻڹಛ௃Λଊ͑Δ •Residual Vector Quantizer (RVQ) ʹΑΓஈ֊తʹԻ੠೾ܗΛྔࢠԽ Mimi 6
  5. •MoshiΛࢧ͑Δج൫ٕज़ͷҰͭɺ96.2MͰConvͱTransformer͔ΒͳΔ (hf) • 80msΛ1 tokenͱͯ͠ѻ͍ɺೖྗαϯϓϦϯάϨʔτ͸24000Hz •Ի੠೾ܗΛ཭ࢄతͳAudio tokenʹม׵͢ΔNeural Audio Codec •

    VQ-VAEͰ஌ΒΕΔdiscrete bottleneckΛ࠾༻ •Audio token͸Acoustic TokenͱSemantic Tokenͷ2छྨ͕ग़ྗ • Semantic Token: Ի੠ͷҙຯతɾԻӆత৘ใΛଊ͑Δ • WavLM ͷຒΊࠐΈදݱΛৠཹ • Acoustic Token: ࡉ͔ͳԻڹಛ௃Λଊ͑Δ •Residual Vector Quantizer (RVQ) ʹΑΓஈ֊తʹԻ੠೾ܗΛྔࢠԽ Mimi 7
  6. RVQ: Πϝʔδਤ (nճޙ) 19 Codebook ྔࢠԽର৅ … id=0 id=1 id=2

    id=3 id=2047 ग़ྗIDྻ [ 1, 3, 2, 2047, …, 4]
  7. Mimiͷ܇࿅֓ཁਤ: ΊͬͪΌ؆ུԽ൛ 21 Mimi
 Encoder Mimi
 Decoder WavLM Cosྨࣅ౓ ❄

    ࠶ߏ੒ଛࣦ + ఢରతଛࣦ non-causalϞσϧͷϕΫτϧ
 ʹ͚ۙͮͭͭɺԻ੠඼࣭΋ߴΊΔ
  8. •·ͣ௨ৗͷࣗݾճؼܕݴޠϞσϧΛߏங • ެ։ӳޠίʔύε 2.1T tokensɺܥྻ௕4096ɺϞσϧαΠζ7B • ߏங͞Εͨ7B LLMΛHeliumͱݺশ • ͜ͷஈ֊Ͱ͸୯७ʹtext-in,

    text-out •࣍ʹɺHeliumΛϕʔεʹԻ੠Λೖग़ྗʹ଍ͯ͠܇࿅ • ͱݴͬͯ΋MimiͷτʔΫϯΛ༧ଌ͢ΔΑ͏ʹ܇࿅͢ΔͷͰ௨ৗͷ
 ݴޠϞσϦϯάͱରͯ͠มΘΒͳ͍ (࣍ͷτʔΫϯ༧ଌ) •Temporal Transformer (HeliumͰॳظԽ) ͱ Depth Transformer͔ΒͳΔ • ͜ͷೋͭΛ·ͱΊͯRQ-Transformerͱݺশ MoshiͷΞʔΩςΫνϟ֓ཁ 28
  9. •·ͣ௨ৗͷࣗݾճؼܕݴޠϞσϧΛߏங • ެ։ӳޠίʔύε 2.1T tokensɺܥྻ௕4096ɺϞσϧαΠζ7B • ߏங͞Εͨ7B LLMΛHeliumͱݺশ • ͜ͷஈ֊Ͱ͸୯७ʹtext-in,

    text-out •࣍ʹɺHeliumΛϕʔεʹԻ੠Λೖग़ྗʹ଍ͯ͠܇࿅ • ͱݴͬͯ΋MimiͷτʔΫϯΛ༧ଌ͢ΔΑ͏ʹ܇࿅͢ΔͷͰ௨ৗͷ
 ݴޠϞσϦϯάͱରͯ͠มΘΒͳ͍ (࣍ͷτʔΫϯ༧ଌ) •Temporal Transformer (HeliumͰॳظԽ) ͱ Depth Transformer͔ΒͳΔ • ͜ͷೋͭΛ·ͱΊͯRQ-Transformerͱݺশ MoshiͷΞʔΩςΫνϟ֓ཁ 29
  10. •1࣌ࠁ͝ͱʹ… •ϢʔβͷԻ੠͕1+7 token •ϞσϧͷԻ੠͕1+7 token •ϞσϧͷςΩετ͕1 token → Multi-stream Modeling

    •1࣌ࠁ͝ͱʹશ෦଍͠߹Θͤͯ
 ୯ҰͷϕΫτϧʹ͠ɺϞσϧ΁
 ೖྗ Moshiͷೖྗ֓ཁਤ 32
  11. •1࣌ࠁ͝ͱʹ… •ϢʔβͷԻ੠͕1+7 token •ϞσϧͷԻ੠͕1+7 token •ϞσϧͷςΩετ͕1 token → Multi-stream Modeling

    •1࣌ࠁ͝ͱʹશ෦଍͠߹Θͤͯ
 ୯ҰͷϕΫτϧʹ͠ɺϞσϧ΁
 ೖྗ Moshiͷೖྗ֓ཁਤ 33 https://github.com/kyutai-labs/moshi/blob/950e9771dc33d7aa48f80175a189c5c902016df2/moshi/moshi/models/lm.py#L381 ݩ࣮૷ (৴͡೉͍͕) 17ݸͷຒΊࠐΈΛ଍͠߹Θͤͯ ҰͭͷϕΫτϧʹ͍ͯ͠Δ Σ੧(❛□❛✿)
  12. •Ұఆ࣌ؒ͝ͱʹೖྗ͕Ϟσϧʹೖͬͯ͘Δ •ετϦʔϛϯάॲཧͷͨΊʹ͸: • Ϟσϧ͸Ұఆ࣌ؒҎ಺ʹॲཧΛ׬ྃͤͯ͞ग़ྗΛग़͢ • ͦΕΛ΋͏Ұ౓ೖྗʹೖΕͭͭɺϢʔβଆͷ࣍ͷ࣌ࠁͷԻ੠΋ೖྗ Moshiͷೖྗ֓ཁਤ: ετϦʔϛϯάॲཧͷ৔߹ 35 ஫ҙ:

    2 token͸͜ͷਤʹ͓͚ΔζϨͰ͋Γɺ࣮ࡍͷ஋Ͱ͸ͳ͍ Ϟσϧͷग़ྗԻ੠ ϢʔβͷೖྗԻ੠ Ϟσϧͷग़ྗςΩετ 2 token෼͚ͩζϨ͍ͯΔ (ฉ͍͍ͯΔ) ৭͕ಉ͡Օॴ͕ରԠ͢Δೖྗ
  13. Ի੠ೝࣝ(ASR), Ի੠߹੒(TTS)΁ͷస༻ 44 ςΩετ Ի੠ Ի੠ ςΩετ ASR TTS •MoshiͷMulti-stream

    Modeling͸؆୯ʹASR, TTS΁ద༻Ͱ͖Δ •ζϨΛม͑Δ͚ͩͰࣗવʹͲͪΒͷλεΫ΋දݱՄೳ • ASRͷ৔߹͸ॻ͖ى͍ͨ͜͠Ի੠Λฉ͍͔ͯΒςΩετΛग़ྗ • TTSͷ৔߹͸ൃԻ͍ͨ͠ςΩετΛݟ͔ͯΒԻ੠Λग़ྗ ͕ͬͪ͜଴ͭ ͕ͬͪ͜଴ͭ
  14. •ධՁ߲໨ • HeliumͷLLM ͱͯ͠ͷೳྗ • Ի੠τʔΫφΠζ • Ի੠LMͱͯ͠ͷೳྗ • Ի੠QA

    • ର࿩ੜ੒඼࣭ • ετϦʔϛϯάASR, TTS • ྔࢠԽ ධՁ࣮ݧ 49
  15. •full-duplexͳԻ੠ର࿩Ϟσϧ Moshi ΛఏҊ • χϡʔϥϧԻ੠ίʔσοΫ Mimi ͱ RQ-Transformer Ͱߏ੒ •Multi-stream

    modelingʹΑΔϢʔβԻ੠ɾϞσϧԻ੠ɾςΩετͷಉ࣌ॲཧ • શମΛcausalʹߏ੒͢Δ͜ͱͰετϦʔϛϯάॲཧΛՄೳʹ ·ͱΊ 56 RQ-Transformer Mimi
 Encoder Mimi
 Decoder Temporal
 Transformer Helium Depth
 Transformer