Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[ASJ2020a] フルコンテキストラベル入力を用いたFastSpeech型ニューラルTTS...
Search
Takuma OKAMOTO
September 11, 2020
Research
1.2k
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
[ASJ2020a] フルコンテキストラベル入力を用いたFastSpeech型ニューラルTTSモデルの比較
Takuma OKAMOTO
September 11, 2020
More Decks by Takuma OKAMOTO
See All by Takuma OKAMOTO
2025/7/5 応用音響研究会招待講演@北海道大学
takuma_okamoto
1
300
2025/1/30「システムデザイン論」@東京都立大学日野キャンパス
takuma_okamoto
0
190
[INTERSPEECH 2024] Challenge of singing voice synthesis using only text-to-speech corpus with FIRNet source-filter neural vocoder
takuma_okamoto
0
250
[Internoise 2023 (invited)] Multilingual sound spot synthesis systems
takuma_okamoto
0
420
マルチスポット再生 meets 多言語同時通訳システム
takuma_okamoto
0
260
[SPEASIP 2023招待講演] マルチスポット再生 meets 多言語ニューラル音声合成 ~実装 is ホンマに all we need~
takuma_okamoto
1
380
和歌山大学2022年度教養科目「世界の情報通信を知る」:音響・音声情報処理編
takuma_okamoto
0
260
[asj2022a] 16チャネル小型円形スピーカアレイを用いたマルチスポット再生システムの実装
takuma_okamoto
0
540
[asj2022a] Harmonic-Net+:高調波入力とLayerwise-Quasi-Periodic畳み込みを用いた基本周波数制御可能な高速ニューラルボコーダ
takuma_okamoto
0
360
Other Decks in Research
See All in Research
定数整数除算・剰余算最適化再考
herumi
1
130
Ghost in the 7‑Zip: The Shadow of Residential Proxies Creeping into Your Life
nttcom
0
1.2k
第66回コンピュータビジョン勉強会@関東 Epona: Autoregressive Diffusion World Model for Autonomous Driving
kentosasaki
0
630
Apache Gravitinoで実現する Icebergカタログ統合とアクセスの一元化
matsumooon
0
290
(SIGQS17) Frasco-VS:フラグメントに基づく薬剤候補化合物選抜の量子アニーリングによる実現
keisukeyanagisawa
PRO
0
120
適応的スパムフィルタのための軽量な類似メッセージカウンタ / jsai2026-adaptive-spam-filter
monochromegane
0
3.7k
「行ける・行けない表」による地域公共交通の性能評価
bansousha
0
160
計算情報学研究室(数理情報学第7研究室)2026
tomohirokoana
0
570
明日から使える!研究効率化ツール入門
matsui_528
13
7.3k
LLM Compute Infrastructure Overview
karakurist
2
1.4k
LLMアプリケーションの透明性について
fufufukakaka
0
240
PGDM: Physically Guided Diffusion Model for L Downscaling
satai
2
280
Featured
See All Featured
Java REST API Framework Comparison - PWX 2021
mraible
34
9.4k
For a Future-Friendly Web
brad_frost
183
10k
Getting science done with accelerated Python computing platforms
jacobtomlinson
2
240
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
A designer walks into a library…
pauljervisheath
211
24k
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
140
A Soul's Torment
seathinner
6
3k
Scaling GitHub
holman
464
140k
Neural Spatial Audio Processing for Sound Field Analysis and Control
skoyamalab
0
340
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
310
Making the Leap to Tech Lead
cromwellryan
135
9.9k
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
230
Transcript
ϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ 'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧͷൺֱ ̋Ԭຊຏɼށాஐج ɼࢤլ๕ଇɼՏҪ߃ ใ௨৴ݚڀػߏɼ໊ݹେֶ UI4FQU "4+"OOVBM.FFUJOH"VUVNO! $PNQBSJTPOPG'BTU4QFFDICBTFEOFVSBM554NPEFMT XJUIGVMMDPOUFYUMBCFMJOQVU
*OUSPEVDUJPO 1VSQPTF 'BTU4QFFDICBTFEOFVSBM554NPEFMTXJUIGVMMDPOUFYUMBCFMJOQVU "MJHO554 +%*5 &YQFSJNFOUT $PODMVTJPOT "OOPVODFNFOU 0VUMJOF
&OEUPFOEܕχϡʔϥϧςΩετԻ߹ 554 4FRVFODFUPTFRVFODF 4FRTFR Ϟσϧɿ5BDPUSPOɼ5SBOTGPSNFSɼ%FFQ7PJDF ςΩετ ·ͨԻૉ ͔ΒԻڹಛྔΛੜɼԻૉΞϥΠϝϯτෆཁ ՝ɿҙػߏ༧ଌࣦഊʹΑΔԻૉͷܽམɼ܁Γฦ͠ˠ࣮αʔϏεʹͱͬͯக໋తɼࣗݾճؼϞσϧ
࠷ઌͷϑϧ&OEUPFOEςΩετԻ߹ɿ&"54ɼ'BTU4QFFDI Իૉ͔ΒԻܗΛͭͷωοτϫʔΫͰੜɼԻૉΞϥΠϝϯτෆཁɼඇࣗݾճؼϞσϧ ՝ɿԻ࣭ʹվળͷ༨͋Γ ɼجຊपΛհ͢Δͷ&OEUPFOEͱݴ͍͍ͬͯͷ͔ ҆ఆܕχϡʔϥϧ554Ϟσϧ #-45. 5BDPEFDɿ50LBNPUPFUBM "436 )..Ͱਪఆͨ͠ڧ੍ΞϥΠϝϯτͱ5BDPUSPOͷσίʔμΛ༻ˠ҆ఆ͔ͭߴ࣭ ՝ɿผ్)..ͷԻૉΞϥΠϝϯτֶश͕ඞཁɼࣗݾճؼϞσϧ 'BTU4QFFDIɿ:3FOFUBM /FVS*14 ॱൖܕ5SBOTGPSNFSɼඇࣗݾճؼ ߴੜ ɼԻૉΞϥΠϝϯτɿڭࢣ5SBOTGPSNFS ࣝৠཹ ڭࢣੜెֶश ʹΑΓࣗݾճؼ5SBOTGPSNFSͱಉͷԻ࣭͔ͭ҆ఆͳੜΛ࣮ݱ ՝ɿࣝৠཹ͕ඞཁɼ-+4QFFDIίʔύεͷͨΊ݁Ռ͕಄ଧͪͳՄೳੑ *OUSPEVDUJPO 3FBMUJNFGBDUPS (16 $16T
ϑϧίϯςΩετϥϕϧΛ༻͍ͨຊޠχϡʔϥϧςΩετԻ߹ ԻૉೖྗͷΈΑΓϑϧίϯςΩετϥϕϧΛ༻͍ͨํ͕ߴ࣭ ࣝৠཹͳ͠ͷ'BTU4QFFDIࣗݾճؼϞσϧʹٴͳ͍ 'BTU4QFFDIͰԻૉܧଓਪఆผͷωοτϫʔΫΛ༻͍ͨํ͕ߴਫ਼ σϞαϯϓϧɿIUUQTBTUBTUSFDOJDUHPKQEFNP@TBNQMFTJDBTTQ@@PLBNPUPJOEFYIUNM 1SFWJPVTSFTVMUT Tacotron 2 Transformer
BLSTM Taco2dec WaveGlow STRAIGHT Original FastSpeech Mean opinion score WG(256) PWG Analysis-synthesis Transformer Taco2dec Only phoneme Full-context label input WaveGlow WG(256) PWG WaveGlow WaveGlow WaveGlow (a) (b) ԬຊΒɼԻߨय़ ˞$07*%ͷͨΊதࢭ
ࣝৠཹෆཁͷ'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧ "MJHO554ɿ;;FOHFUBM *$"441 ࠞ߹ີωοτϫʔΫʹΑΓԻૉΞϥΠϝϯτΛਪఆɼԻૉܧଓผωοτϫʔΫͰਪఆɼจࣈೖྗ ӳޠ +*%5 +PJOUMZUSBJOFE%VSBUJPO*OGPSNFE5SBOTGPSNFS ɿ%-JNFUBM
ࣗݾճؼ5SBOTGPSNFSͱ'BTU4QFFDIΛಉֶ࣌शɼԻૉೖྗ ؖࠃޠ 'BTU4QFFDIɿ:3FOFUBM ԻૉΞϥΠϝϯτʹ.POUSFBMGPSDFEBMJHOFS .'" Λ༻ɼجຊपΛ్தͰར༻ɼԻૉೖྗ ӳޠ 'BTU1JUDIɿ"-︎BO DVDLJ ԻૉΞϥΠϝϯτʹ5BDPUSPOͷਪఆ݁ՌΛ༻ɼجຊपΛ్தͰར༻ɼԻૉೖྗ ӳޠ తɿϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ'BTU4QFFDIܕຊޠ554Ϟσϧͷൺֱ )..ɼ.'"ɼ5BDPUSPOɼ"MJHO554ɼ+%*5ͷछྨͷԻૉΞϥΠϝϯτํࣜΛൺֱ ˠͦΕͧΕͷϞσϧͰ༻͍ͯ͠ΔΞϥΠϝϯτํ͕ࣜҟͳΔͨΊ 'BTU4QFFDIͷϞσϧߏͷҧ͍ "MJHO554ɼ+%*5 ˠ"MJHO554͓Αͼ+%*5Ͱ'BTU4QFFDI ॱൖܕ5SBOTGPSNFS ͷϞσϧߏ͕ҟͳΔͨΊ 1VSQPTF ຊޠχϡʔϥϧ554ͷߴੜϞσϧͷߴԻ࣭ԽՄೳ͔
Length Regulator + ∼Positional Encoding FFT Block Linear Layer Mel-spectrogram
Halign N× N× FFT Block + ∼Positional Encoding H Linear Layer ×N Mix Density Network Forward Algorithm {yi }n p(yi |µj, Σj ) {(µj, Σj )}m − log αn,m Alignment Loss Only Training (2) Feed-Forward Transformer (3) Mix Density Network Full-context label 1 × 1 Conv Layer Multi-Head Attention Add & Norm Conv 1D Add & Norm (1) FFT Block N× FFT Block + ∼Positional Encoding Linear Layer Duration Sequence (4) Duration Predictor Full-context label 1 × 1 Conv Layer ֶशํ๏ ΦϦδφϧ εςοϓɿΤϯίʔμ͓Αͼࠞ߹ॏΈωοτϫʔΫͷֶश ࠞ߹ॏΈωοτϫʔΫ͔Βଟ࣍ݩਖ਼نͷฏۉͱࢄΛֶश ɹฏۉˠԻૉʮYʯͷฏۉతͳϝϧεϖΫτϩάϥϜ ɹࢄˠԻૉʮYʯͷεϖΫτϩάϥϜͷΒ͖ͭʹରԠ 7JUFSCJΞϧΰϦζϜʹΑΓԻૉΞϥΠϝϯτΛऔಘ εςοϓɿσίʔμͷֶश ΤϯίʔμΛݻఆͯ͠σίʔμͷΈΛֶश εςοϓɿಉ࣌࠷దԽ 'BTU4QFFDIͱࠞ߹ॏΈωοτϫʔΫΛಉֶ࣌श ɹԻૉΞϥΠϝϯτֶशͷߋ৽͞ΕΔ εςοϓ ࠷ޙʹ֬ఆͨ͠ԻૉΞϥΠϝϯτͰԻૉܧଓਪఆωοτϫʔΫΛֶश ֶशํ๏ͷৄࡉ *$"441ԻڹԻಡΈձͷࢿྉΛࢀরͷ͜ͱ IUUQTCJUMZ8XTF "MJHO554
ֶशํ๏ 'BTU4QFFDIͱࣗݾճؼܕ5SBOTGPSNFSͷσίʔμ Λಉ࣌࠷దԽ -ଛࣦɿԻڹಛྔਪఆ -ଛࣦɿԻૉܧଓਪఆ ҙػߏͷର֯Խͷଅਐ $5$ଛࣦɿԻૉܥྻΛ σίʔμग़ྗ͔Βٯਪఆ ༠ಋҙଛࣦ ߹࣌
'BTU4QFFDIͷΈΛਪ Length Regulator FFT Block Linear Layer + ∼Positional Encoding Halign N× N× FFT Block + ∼Positional Encoding H Full-context label 1 × 1 Conv Layer Encoder Pre-net Duration Predictor Attention Mechanism Decoder Pre-net Mel-spectrogram + ∼Positional Encoding Linear Layer Decoder Linear Layer CTC loss Guided attention loss L1 loss L2 loss L1 loss Only Training 1 2 3 4 5 phoneme +%*5
ԻڹϞσϧ 'BTU4QFFDIܕϞσϧ ॱൖܕ5SBOTGPSNFSɿ''5 ͷൺֱ "MJHO554ܕϞσϧPS+*%5ܕϞσϧɿνϟωϧɼ''5ϒϩοΫͷߏɼ''5ϒϩοΫ૯͕ҟͳΔ ˞ͦΕͧΕ'BTU4QFFDI෦ͷΈΛ୯ಠͰֶशɼԻૉܧଓਪఆͳ͠ 7BOJMMB+%*5ɿࣗݾճؼܕ5SBOTGPSNFSσίʔμ͓ΑͼԻૉܧଓਪఆؚΉ ԻૉΞϥΠϝϯτํࣜ ).. )54ɼ.FSMJO
.POUSFBM'PSDFE"MJHOFS ."' ɿ(.. -%" ,BMEJ "MJHO554ɿεςοϓͷֶशͷΈ͔ΒಘΒΕΔԻૉΞϥΠϝϯτɼεςοϓͷಉ࣌࠷దԽͳ͠ +%*5ɿ'BTU4QFFDIͷΤϯίʔμ͓Αͼࣗݾճؼܕ5SBOTGPSNFSσίʔμͷΈΛֶश 5BDPUSPOɿֶशޙͷֶशηοτͷਪ࣌ͷҙॏΈ͔ΒԻૉΞϥΠϝϯτΛऔಘ ˠͦΕͧΕผ్''5ܕԻૉܧଓਪఆΛֶश ''5ɼ#-45. 5BDPEFD ֶशσʔλɿຊޠঁੑϓϩऀ໊ɿ ൃ ࣌ؒ ɼL)[ ϑϧίϯςΩετϥϕϧɿԻૉ࣍ݩ ΞΫηϯτใ࣍ݩˠܭ࣍ݩ &YQFSJNFOUBMDPOEJUJPOT
ܭଌ݅ɿ1Z5PSDI࣮ (16ɿ/7*%*"5FTMB7 $16ɿ*OUFM9FPO ԻڹϞσϧɿ࠷େίΞ༻ 8BWF(MPXɿίΞ༻ ϊʔυͷ࠷େ ݁Ռ 'BTU4QFFDIܕϞσϧࣗݾճؼϞσϧ $16༻࣌35'ఔ
$16Λ༻͍ͨϦΞϧλΠϜχϡʔϥϧ554 -1$/FU!L)[ ˠ5PUBM35'ɿ ,.BUTVCBSB 50LBNPUP 35BLBTIJNB 55BLJHVDIJ 55PEB :4IJHBBOE),BXBJ *OWFTUJHBUJPOPG USBJOJOHEBUBTJ[FGPSSFBMUJNFOFVSBMWPDPEFSTPO$16T z"DPVTU4DJ5FDI BDDFQUFE UPBQQFBS 3FTVMUTPGSFBMUJNFGBDUPST 35'T ͳ݁Ռ͕ಘΒΕ͓ͯΓɼAlignTTS ผωοτϫʔΫΛ༻͍ͯ͠Δɻͦ ͰɼVanilla JDI-T Ҏ֎ɼશͯ (4) ͷܧଓਪఆωοτϫʔΫΛ༻͍ ɻ·ͨɼൺֱͷͨΊɼࣗݾճؼܕϞσ ίϯςΩετϥϕϧೖྗܕ Tacotron erɼBLSTM+Taco2dec[6] ಋೖ͢Δɻ o2dec ʹ 5 छྨͷԻૉΞϥΠϝϯτ Δɻ Ի 23,828 ൃ ( 18 ࣌ؒ) Λֶशηο ετηοτͱ͠ɼαϯϓϦϯάप ͨɻϝϧεϖΫτϩάϥϜจݙ [5, 6] ͠ɼϑʔϨϜγϑτྔ 12.5 ms ͱ͠ ςΩετϥϕϧɼԻૉ 39 ࣍ݩͱΞΫ ࣍ݩͷܭ 48 ࣍ݩͱͨ͠ɻਪఆͨ͠ϝϧ ϥϜΛԻܗͱม͢Δχϡʔϥϧ Table 1 Results of inference real-time factors (RTFs) of neural network models with an NVIDIA Tesla V100 and Intel Xeon 6152 CPUs. “FFT” de- notes feed-forward Transformer. GPU CPUs BLSTM+Taco2dec 0.015 0.21 Tacotron 2 0.063 0.22 Transformer 0.55 3.2 FFT (AlignTTS) 0.005 0.026 FFT (JDI-T) 0.005 0.026 FFT duration model 0.0007 0.0024 WaveGlow 0.066 2.1 ໊ͷਓຊޠޠऀͰ͋Γɼจݙ [5, 6] ͱಉ༷ɼ
.04SFTVMUTBOEEFNPTBNQMFT Mean opinion score FFT (AlignTTS) FFT (JDI-T) BLSTM+Taco2dec
Tacotron 2 Transformer Original Vanilla JDI-T Alignment Acoustic model WaveGlow (analysis-synthesis) HMM MFA Tacotron 2 AlignTTS JDI-T HMM MFA Tacotron 2 AlignTTS JDI-T HMM MFA Tacotron 2 AlignTTS JDI-T Non-autoregressive Autoregressive Seq2seq 'BTU4QFFDIܕϞσϧࣗݾճؼϞσϧʹٴͳ͍
݁Ռߟ 'BTU4QFFDIܕϞσϧ ඇࣗݾճؼϞσϧ ࣗݾճؼϞσϧʹԻ࣭ٴͳ͍ Իڹಛྔͷࣗݾճؼॏཁ ˠجຊपͷิॿใʹΑΔԻ࣭ͷվળ 'BTU4QFFDIɼ'BTU1JUDI ''5ߏͷҧ͍ͳ͠ "MJHO554PS+%*5
ΞϥΠϝϯτํࣜ "MJHO554Ҏ֎༏Ґࠩͳ͠ˠ+%*5ͷԻૉΞϥΠϝϯτྑ "MJHO554ͰϑϧίϯςΩετϥϕϧΛ༻ˠϥϕϧͷҧ͍͕ѱӨڹˠԻૉͷΈͰͷݕ౼ ϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧͷൺֱ "MJHO554༻͓Αͼ+%*5༻Ϟσϧʹ͓͍ͯछྨͷԻૉΞϥΠϝϯτํࣜΛൺֱ $16Λ༻͍ͨߴੜΛ֬ೝ Ի࣭ࣗݾճؼϞσϧ 5BDPUSPOɼ5SBOTGPSNFSɼ#-45. 5BDPEFD ʹٴͳ͍ %JTDVTTJPOTBOEDPODMVTJPOT
ʙ ۚ ͚͍Μͳ3%ϑΣΞˏΦϯϥΠϯ χϡʔϥϧมٕज़ͷհ ԬຊΒɼzෳऀ8BWF/FUϘίʔμΛ༻͍ͨχϡʔϥϧมͷࢼΈzˏ݄41ݚڀձ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ˞$07*%ͷͨΊதࢭ "OOPVODFNFOU
Residual block Residual block Residual block Residual block + ReLU 1 × 1 CNN ReLU 1 × 1 CNN Softmax p(xn |x0, · · · , xn−1 ) Skip connections · · · · · · Residual block + 1 × 1 CNN 2 × 1 dilated CNN × tanh σ Upsample layer Bidirectional GRU Mel-spectrogram Upsample layer Mel-spectrogram (a) Bidirectional GRU Upsample layer Mel-spectrogram (b) Bidirectional GRU for rate conversion Resampling for rate conversion Resampling Mean opinion score ST R A IG H T SD -W aveN et SI-W aveN et (a) (a) (b) (b) W SO LA 4QFFDISBUFDPOWFSTJPOSBUF 5SBJOFEVTJOH+74DPSQVT 3FTBNQMJOHBDPVTUJDGFBUVSFT GPSTQFFDISBUFDPOWFSTJPO