$16Λ༻͍ͨϦΞϧλΠϜχϡʔϥϧ554 -1$/FU!L)[ ˠ5PUBM35'ɿ ,.BUTVCBSB 50LBNPUP 35BLBTIJNB 55BLJHVDIJ 55PEB :4IJHBBOE),BXBJ *OWFTUJHBUJPOPG USBJOJOHEBUBTJ[FGPSSFBMUJNFOFVSBMWPDPEFSTPO$16T z"DPVTU4DJ5FDI BDDFQUFE UPBQQFBS 3FTVMUTPGSFBMUJNFGBDUPST 35'T ͳ݁Ռ͕ಘΒΕ͓ͯΓɼAlignTTS ผωοτϫʔΫΛ༻͍ͯ͠Δɻͦ ͰɼVanilla JDI-T Ҏ֎ɼશͯ (4) ͷܧଓਪఆωοτϫʔΫΛ༻͍ ɻ·ͨɼൺֱͷͨΊɼࣗݾճؼܕϞσ ίϯςΩετϥϕϧೖྗܕ Tacotron erɼBLSTM+Taco2dec[6] ಋೖ͢Δɻ o2dec ʹ 5 छྨͷԻૉΞϥΠϝϯτ Δɻ Ի 23,828 ൃ ( 18 ࣌ؒ) Λֶशηο ετηοτͱ͠ɼαϯϓϦϯάप ͨɻϝϧεϖΫτϩάϥϜจݙ [5, 6] ͠ɼϑʔϨϜγϑτྔ 12.5 ms ͱ͠ ςΩετϥϕϧɼԻૉ 39 ࣍ݩͱΞΫ ࣍ݩͷܭ 48 ࣍ݩͱͨ͠ɻਪఆͨ͠ϝϧ ϥϜΛԻܗͱม͢Δχϡʔϥϧ Table 1 Results of inference real-time factors (RTFs) of neural network models with an NVIDIA Tesla V100 and Intel Xeon 6152 CPUs. “FFT” de- notes feed-forward Transformer. GPU CPUs BLSTM+Taco2dec 0.015 0.21 Tacotron 2 0.063 0.22 Transformer 0.55 3.2 FFT (AlignTTS) 0.005 0.026 FFT (JDI-T) 0.005 0.026 FFT duration model 0.0007 0.0024 WaveGlow 0.066 2.1 ໊ͷਓຊޠޠऀͰ͋Γɼจݙ [5, 6] ͱಉ༷ɼ