evaluations. The values in the columns for mel-cepstral distortion (MCD) and log fo root-mean-square error (RMSE) are the means and standard deviations. AM and NV represent acoustic model and neural vocoder, respectively. TTS SVS AM NV Input fo shift Acoustic feature MCD [dB] log fo RMSE MCD [dB] log fo RMSE CFS2 HiFi-GAN mel-spectrogram 6.20 ± 0.49 0.22 ± 0.06 11.4 ± 0.83 0.41 ± 0.14 CFS2 HiFi-GAN WORLD 6.39 ± 0.51 0.22 ± 0.06 12.0 ± 1.08 0.41 ± 0.10 PESC HiFi-GAN WORLD 6.35 ± 0.51 0.22 ± 0.06 11.5 ± 1.17 0.37 ± 0.09 PESC HiFi-GAN √ WORLD - - 10.9 ± 0.92 0.57 ± 0.15 CFS2 FIRNet WORLD 6.41 ± 0.50 0.23 ± 0.06 12.2 ± 1.02 0.35 ± 0.10 PESC FIRNet WORLD 6.39 ± 0.49 0.23 ± 0.06 11.6 ± 1.18 0.34 ± 0.09 PESC FIRNet √ WORLD - - 10.5 ± 1.00 0.30 ± 0.11 els. The JSUT-Song corpus was used only for evaluating the SVS model. Following the procedure established for ESPnet2- TTS [8], 4,500 utterances, 250 utterances, and 250 utterances were used for the training set, validation set, and test set for TTS, respectively. For TTS in Japanese, the G2P function based on pyopenjtalk and enhanced with prosody symbols [34] was used, following [8, 11]. For the subjective evaluations of TTS, 10 utterances were randomly selected. For the objective and subjective evaluations of SVS, the first phrases of 10 songs from lated by the ESPnet2-TTS toolkit [8, 9]. To evaluate the syn- thesized TTS and SVS speech subjectively, mean opinion score (MOS) tests [38] were conducted for (a) normal TTS, (b) nor- mal SVS, (c) T × 0.5 SVS (240 beats per minute), (d) fo × 0.5 SVS (−1 octave), and (e) fo × 2.0 SVS (+1 octave). In (d) and (e), fo was controlled before the neural vocoders. Each sub- ject evaluated 200 samples: 10 utterances × 20 models. The naturalness of each sample was rated on a five-point scale: (1) bad, (2) poor, (3) fair, (4) good, and (5) excellent. Twenty adult