Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

Thirty Years of Progress in Speech Synthesis: A...

Avatar for Keiichi Tokuda Keiichi Tokuda
December 08, 2025

Thirty Years of Progress in Speech Synthesis: A Personal Perspective on the Past, Present, andย Future

Speech synthesis is a technology that generates speech waveforms corresponding to input text, and it has been a subject of sustained research for many years. The fundamental framework of statistical speech synthesis is to generate speech for any given input text based on a large database of paired speech waveforms and corresponding text. This generation process is generally divided into multiple stagesโ€”text analysis, acoustic modeling, and waveform generationโ€”each of which is modeled using statistical machine learning techniques, and more recently, deep learning methods. In this talk, I will provide an overview of the progress made in statistical approaches to speech synthesis over the past few decades, incorporating personal episodes and even some failure stories from my own experience. I will also discuss ongoing challenges related to speech quality, controllability, and application diversity, and present my personal perspective on future directions in the field.

Avatar for Keiichi Tokuda

Keiichi Tokuda

December 08, 2025
Tweet

Other Decks in Research

Transcript

  1. Thirty Years of Progress in Speech Synthesis: A Personal Perspective

    on the Past, Present, and Future Keiichi Tokuda Nagoya Institute of Technology December 5, 2025 Symposium on Speech & Behavior Informatics
  2. Challenges in speech synthesis โ€ข Converting text into speech: Text-to-Speech

    (TTS) โ€ข Realizing machines that speak like humans, with capabilities such as: โ€ข Voice characteristics of arbitrary speakers โ€ข Various speaking styles (e.g., reading style, conversational style) โ€ข Emotional expression (e.g., joyful, sad) โ€ข Emphasis on specific words โ€ข Timing control and fillers โ€ข Other types of nonverbal information โ€ข And in any language! โ€ข Furthermore, even singing and rap performances!!
  3. A brief history of speech synthesis โ€ข Rule-based Speech Synthesis

    (โ€“1980s) โ€ข Methods based on human expert knowledge โ€ข Concatenative Speech Synthesis (1990s) โ€ข Data-driven methods based on waveform concatenation โ€ข Statistical Speech Synthesis (mid-1990sโ€“) โ€ข Machine-learningโ€“based methods โ€ข Later accelerated by AI technologies (Deep Learning) We have been working on this ๏ƒŸ ๏ƒŸ >> >>
  4. Rule-based / formant speech synthesis โ€ข Approaches based on hand-crafted

    rules and formant synthesizers โ€ข KlattTalk: formant synthesizer [Klatt, JASA 1980] โ€ข MITalk: text-to-speech system [Allen+, Cambridge University Press 1987] โ€ข DECtalk๏ผšcommercial product [Digital Equipment Corp. 1984] DECtalk demo: Wikimedia Commons (CC BY-SA 3.0) โ†ฉ โ€œDaisy Bellโ€ sung by DECtalk: Wikipedia (CC0 1.0๏ผ‰
  5. Concatenative synthesis (fixed-unit) Diphone synthesis โ€ข PSOLA [Moulines+, SPECOM 1990]

    โ€ข MBROLA [Dutoit+, ICSLP 1996] Parametric concatenation โ€ข Units stored as acoustic parameters (LPC, cepstrum, F0, etc.) โ€ข Connected / interpolated in the parameter domain, and resynthesized with a vocoder [Imai, IECE-JA 1978] Prosody modeling โ€ข Fujisaki model (Japanese prosody generation) [Fujisaki+, IEICE-A 1993] ใƒปใƒปใƒป ใƒปใƒปใƒป Synthesized speech a i u ๏ƒŸ ๏ƒŸ โ†ฉ
  6. works well: goes wrong: Concatenative synthesis (unit selection) ใƒปใƒปใƒป ใƒปใƒปใƒป

    How well matches the target? (target cost) How smoothly connected? (concatenation cost) a large database of recorded speech Selected units (synthesized speech) Automatically selects and concatenates waveform segments Minimizes total cost at runtime using dynamic programming ใ‚ ใ„ ใ† โ€ข ฮฝ-talk [Sagisaka+, ATR, ICSLP 1992] โ€ข CHATR [Hunt+, ATR, ICASSP 1996] โ€ข Festival [Black+, Edinburgh, SPECOM 1998] โ€ข NextGen [Syrdal+, AT&T, ICSLP 2000] ๏ƒŸ ๏ƒŸ โ†ฉ
  7. Statistical formulation of speech synthesis Text analysis ๐‘ƒ ๐’ ๐’˜,

    ๐œ†๐ฟ Acoustic model ๐‘ ๐’ ๐’, ๐œ†๐ด Vocoder ๐‘ ๐’™ ๐’ Acoustic feature Acoustic feature ๐’ Linguistic feature Linguistic feature ๐’ Text ๐’˜ Speech ๐’™ Generative model ๐‘ ๐’™ ๐’˜, ๐œ† Text ๐’˜ Speech ๐’™ >> >> ๏ผˆGrapheme level๏ผ‰ ๏ผˆPhoneme level๏ผ‰ ๏ผˆFrame level๏ผ‰ ๏ผˆSample level๏ผ‰ ๏ผˆGrapheme level๏ผ‰ ๏ผˆSample level๏ผ‰
  8. Hidden Markov model-based approach (hidden Markov model; HMM) Mel-cepstral analysis

    + MLSA filter Mel-cepstral analysis + MLSA filter [Imai, ICASSP 1983], [Tokuda+, IEICE-JD 1991], [Fukada+, ICASSP 1992] Hidden Markov Model (HMM) Hidden Markov Model (HMM) [Tokuda+, ICASSP 1995], โ€ฆ >> >> Text analysis ๐‘ƒ ๐’ ๐’˜, ๐œ†๐ฟ Acoustic model ๐‘ ๐’ ๐’, ๐œ†๐ด Vocoder ๐‘ ๐’™ ๐’ Acoustic feature Acoustic feature ๐’ Linguistic feature Linguistic feature ๐’ Text ๐’˜ Speech ๐’™ ๏ผˆGrapheme level๏ผ‰ ๏ผˆPhoneme level๏ผ‰ ๏ผˆFrame level๏ผ‰ ๏ผˆSample level๏ผ‰
  9. โ€ข ML estimation of mel-cepstrum: when ๐’™ is Gaussian process,

    ๐‘ ๐’™ ๐’„ is convex with respect to ๐’„ [Tokuda+, IEICE-JD 1991], [Fukada+, ICASSP 1992] โ€ข ๐ป ๐‘ง can be implemented by the MLSA filter structure Mel-cepstral analysis / MLSA filter Frequency ๐œ” (rad) warped frequency mel-scale frequency ๏ฐ 2 ๏ฐ ๏ฐ 0 2 / ๏ฐ เทœ ๐’„ = arg max ๐’„ ๐‘ ๐’™ ๐’„ ๐ป ๐‘’๐‘—๐œ” = exp เท ๐‘š=0 ๐‘€ ๐‘ ๐‘š ๐‘’โˆ’๐‘— เทฅ ๐œ”๐‘š , ๐‘’โˆ’๐‘— เทฅ ๐œ” = ๐‘’โˆ’๐‘—๐œ” โˆ’ ๐›ผ 1 โˆ’ ๐›ผ๐‘’โˆ’๐‘—๐œ” ๐’„ = ๐‘ 0 , ๐‘ 1 , โ€ฆ , ๐‘(๐‘€) T ๏ƒŸ mel-cepstrum ๏ƒŸ ๏ƒŸ ๐‘ง = ๐‘’๐‘—๐œ” Frequency-transformation by 1st order all-pass function Warped frequency เทฅ ๐œ” = ๐›ฝ ๐œ” (rad) เทฅ ๐œ” = ๐›ฝ ๐œ” = tanโˆ’1 (1 โˆ’ ๐›ผ2) sin ๐œ” (1 + ๐›ผ2) cos ๐œ” โˆ’ 2๐›ผ [Imai, ICASSPโ€™83], [Tokuda+, IEICE-JDโ€™91], [Fukada+, ICASSPโ€™92]
  10. Hidden Markov model (HMM) 11 a 22 a 33 a

    12 a 23 a ) ( 1 t b o ) ( 2 t b o ) ( 3 t b o 1 o 2 o 3 o 4 o 5 o T o ๏Œ ๏Œ ใƒป ใƒป 1 2 3 1 1 1 1 2 2 3 3 ๏‹ ๏‹ o q Observation sequence State sequence ij a ) ( t q b o : state transition probability : state output probability Each state output probability b ไธ‹ ไป˜ ใ q ใ€ ๅทฆ ๅฐ ใ‹ ใฃ ใ“ ๅคช ๅญ— o ไธ‹ ไป˜ ใ t ๅณ ๅฐ ใ‹ ใฃ ใ“ is controlled by a regression tree conditioned on linguistic feature ๅคช ๅญ— ๆ–œ ไฝ“ l Each state output probability ๐‘๐‘ž (๐จ๐‘ก ) is controlled by a regression tree conditioned on linguistic feature ๐’
  11. โ€ข Conditional independence assumption of output probabilities โ€ข Parameter generation

    algorithm โ€ข Modeling of F0 (fundamental frequency) โ€ข Multi-space distribution HMM (state output distributions for F0) โ€ข Multi-stream / full-context clustering โ€ข Modeling of state duration โ€ข Hidden semi-Markov model (HSMM) Challenges in HMM-based Speech Synthesis [Tokuda+, ICASSP 1995, EUROSPEECH 1995, ICASSP 2000] [Yoshimura+, ICSLP 1998] [Tokuda+, ICASSP 1999] [Yoshimura+, EUROSPEECH 1999] ๏ƒŸ ๏ƒŸ Simultaneous modeling of spectrum, F0, and duration
  12. Trajectory HMM q c Mean trajectory sil a i d

    a sil sil a i d a sil 5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55 Time(frame) Temporal covariance matrix q P 1 ๐‘๐’„ ๐‘ƒ ๐’ ๐’’, แˆ˜ ๐œ† = ๐‘ ๐’„ ๐’„๐’’ , ๐‘ท๐’’ with dynamic feature w/o dynamic feature ๐‘๐’„ = เถฑ ๐‘ƒ ๐’ ๐’’, แˆ˜ ๐œ† ๐‘‘๐’„ [Tokuda+, EUROSPEECH 2003], [Zen+, CSL 2007] โ†ฉ
  13. Structure of state output (observation) vector Spectrum part Excitation part

    (e.g., F0) Spectral parameter (e.g., mel-cepstrum๏ผ‰ log F0, Voiced / Unvoiced ๏„ ๏„ ๏„ ๏„ ๏„ ๏„ Dynamic feature (corresponds to the time derivative) Dynamic feature (corresponds to the time derivative) ๏ƒŸ ๏ƒŸ Dynamic feature [Furui, IEEE TASSP 1986] ๏ƒŸ ๏ƒŸ ๏ƒŸ ๏ƒŸ
  14. Structure of state output (observation) vector Spectrum part Excitation part

    (e.g., F0) Spectral parameter (e.g., mel-cepstrum๏ผ‰ log F0, Voiced / Unvoiced ๏„ ๏„ ๏„ ๏„ ๏„ ๏„ Dynamic feature (corresponds to the time derivative) Dynamic feature (corresponds to the time derivative) ๏ƒŸ ๏ƒŸ ๏ƒŸ ๏ƒŸ ๏ƒŸ ๏ƒŸ Dynamic feature [Furui, IEEE TASSP 1986]
  15. Stream-dependent tree-based clustering Regression trees for spectrum parameter Regression trees

    for F0 parameter HMM State duration model Reggression tree for state duration models Three dimensional Gaussian Trained using the EM algorithm Each regression tree is conditioned on linguistic feature ๅคช ๅญ— ๆ–œ ไฝ“ l Each regression tree is conditioned on linguistic feature ๐’ โ†ฉ
  16. Flexibility to control speech variations โ€ข Speaker Adaptation (mimicking voices)

    โ€ข [Tamura+, ESCA SSW 1998], [Tamura+, ICASSP 2001], [Yamagishi+, ICASSP 2003], โ€ฆ โ€ข Speaker Interpolation (mixing voices) โ€ข [Yoshimura+, EUROSPEECH 1997], โ€ฆ โ€ข Eigenvoice (producing voices) โ€ข [Shichiri+, ICSLP 2002], [Kazumi+, ICASSP 2010], โ€ฆ โ€ข Cross-lingual (speaking in another language) โ€ข [Wu+, ISCSLP 2008], [Oura+, ICASSP 2010], โ€ฆ Only from publications by the HTS working group
  17. My Personal History in Statistical Speech Synthesis โ€ข ใ€œ1995: Research

    on speech spectrum analysis, speech coding, and adaptive filters โ€ข 1995โ€“: Rise of unit-selection speech synthesis โ€ข 1995: Proposal of the algorithm for generating parameters from HMMs โ€ข 1995โ€“1999: Developed a complete system โ€” and enjoyed the journey โ€ข 2001โ€“2002: sabbatical at Carnegie Mellon University (global dissemination of English TTS systems) โ€ข 2002: Release of HTS version 1.0 โ€ข 2002: IEEE Speech Synthesis Workshop (introducing the English system) โ€ข 2005: Blizzard Challenge started โ€ข 2005โ€“: Practical applications began to appear (Voice Signal, SVOX, iFLYTEK, ATR, Nuance Communications, KDDI Labs, NTT DOCOMO, Google Android, etc.) โ€ข 2008โ€“2011: EU FP7 EMIME Project (Edinburgh, Cambridge, Helsinki Tech, IDIAP, Nokia, NITech) โ€ข 2011โ€“2017: JST CREST uDialogue Project (NITech, Edinburgh, NII) โ€ข 2013: Real-world deployment: CeVIO Project, JOYSOUND Vocal Assist โ€ข 2013: Emergence of DNN-based speech synthesis โ€ข 2014โ€“2015: Sabbatical at Google (DNN-based waveform generation) โ€ข 2016: WaveNet >> >> ๏ƒŸ โ†’
  18. Blizzard Challenge 2005 Discussions with Prof. Alan Black during my

    sabbatical stay at CMU (2001โ€“2002) โ€ข The performance of TTS system strongly depends on the speech database โ€ข It is difficult to fairly compare speech synthesis techniques themselves Alan: โ€œIโ€™ve recorded the data. Letโ€™s do it.โ€ โ€” at ISCA Speech Synthesis Workshop 2004 The need for an evaluation campaign for speech synthesis systems using a common dataset [Black+, INTERSPEECH 2005]
  19. Improvements introduced through the Blizzard Challenge 3. Introduction of STRAIGHT

    Pitch-synchronous analysis and band aperiodicity measures [Kawahara+, SPECOM 1999] 1. Introduction of Hidden Semi-Markov Model (HSMM) Joint training of duration models [Zen+, IEICE-D 2007] 2. GV-based parameter generation Recovering the over-smoothing caused by acoustic models [Toda+, INTERSPEECH 2005] Key persons: Tomoki Toda, Heiga Zen ๏ƒŸ ๏ƒŸ Text analysis ๐‘ƒ ๐’ ๐’˜, ๐œ†๐ฟ Acoustic model ๐‘ ๐’ ๐’, ๐œ†๐ด Vocoder ๐‘ ๐’™ ๐’ Acoustic feature Acoustic feature ๐’ Linguistic feature Linguistic feature ๐’ Text ๐’˜ Speech ๐’™
  20. 2024 IEEE James L. Flanagan Speech and Audio Processing Award

    For contributions to statistical speech synthesis and speech signal processing Recognizing work that laid the foundations for modern neural speech generation. ๏ƒŸ ๏ƒŸ
  21. Introduction of deep neural network โ€ข DNN-based speech synthesis [Zen+,

    ICASSP 2013] โ€ข LSTM-based speech synthesis [Fan+, INTERSPEECH 2014], etc. FFNN, LSTM ๏ƒŸ ๏ƒŸ Text analysis ๐‘ƒ ๐’ ๐’˜, ๐œ†๐ฟ Acoustic model ๐‘ ๐’ ๐’, ๐œ†๐ด Vocoder ๐‘ ๐’™ ๐’ Acoustic feature Acoustic feature ๐’ Linguistic feature Linguistic feature ๐’ Text ๐’˜ Speech ๐’™
  22. Sabbatical Leave (2014โ€“2015) One-year stay at Google โ€ข I decided

    to pursue risky research, since I was temporarily free from daily responsibilities. โ€ข Abundant resources โ€ข Googleโ€™s computational resources โ€ข Googleโ€™s software tools โ€ข My former student Heiga Zen (working at Google) โ€ข And some of my own time โ€œWaitโ€ฆ isnโ€™t risky research what we are supposed to do at universities?โ€ Direct modeling of speech waveforms using neural networks DNNs were used for acoustic modeling, but quality was still limited by vocoders >> >>
  23. Direct Modeling of Speech Waveforms FFNN, LSTM Source filter model

    Training neural networks to directly maximize the likelihood of speech waveforms โ€ข Directly modeling speech waveforms by neural networks [Tokuda+, ICASSP 2015] โ€ข Directly modeling voiced and unvoiced components by neural networks [Tokuda+, ICASSP 2016] Direct waveform modeling ๏ƒŸ ๏ƒŸ Text analysis ๐‘ƒ ๐’ ๐’˜, ๐œ†๐ฟ Acoustic model ๐‘ ๐’ ๐’, ๐œ†๐ด Vocoder p ใฎ ๅคช ๅญ— ๆ–œ ไฝ“ โ“œ x , ๅคช ๅญ— ๆ–œ ไฝ“ o ๅณ ๅฐ ใ‹ ใฃ ใ“ Vocoder ๐‘ ๐’™ ๐’ Acoustic feature Acoustic feature ๐’ Linguistic feature Linguistic feature ๐’ Text ๐’˜ Speech ๐’™ >> >>
  24. Speech signal model โ„Ž๐‘ฃ (๐‘›) White Gaussian ๐‘’(๐‘›)~๐’ฉ(๐‘ฅ; 0, 1)

    Voiced component ๐‘ฃ ๐‘› = ๐‘(๐‘›) โˆ— โ„Ž๐‘ฃ (๐‘›) Signal model for unvoiced+voiced sounds โ„Ž๐‘ฃ ๐‘› = 1 2๐œ‹ เถฑ โˆ’๐œ‹ ๐œ‹ ๐ป๐‘ฃ (๐‘’๐‘—๐œ”) ๐‘’๐‘—๐œ”๐‘› ๐‘‘๐œ” โ„Ž๐‘ข (๐‘›) Speech signal ๐‘ฅ ๐‘› โ„Ž๐‘ข ๐‘› = 1 2๐œ‹ เถฑ โˆ’๐œ‹ ๐œ‹ ๐ป๐‘ข ๐‘’๐‘—๐œ” ๐‘’๐‘—๐œ”๐‘› ๐‘‘๐œ” Pulse train ๐‘(๐‘›) Minimum phase Mixed phase Unvoiced component ๐‘ข ๐‘› = ๐‘(๐‘›) โˆ— โ„Ž๐‘ข (๐‘›)
  25. Direct Modeling of Speech Waveforms โ€œWaveNetโ€ Text analysis ๐‘ƒ ๐’

    ๐’˜, ๐œ†๐ฟ Autoregressive generative model p ใฎ ๅคช ๅญ— ๆ–œ ไฝ“ โ“œ x , ๅคช ๅญ— ๆ–œ ไฝ“ l , ใƒฉ ใƒ  ใƒ€ ไธ‹ ไป˜ ใ ๅคง ๆ–‡ ๅญ— A. ๅคง ๆ–‡ ๅญ— V ใ€ ๅณ ๅฐ ใ‹ ใฃ ใ“ Autoregressive generative model ๐‘ ๐’™ ๐’, ๐œ†๐ด๐‘‰ โ€ข WaveNet: A Generative Model for Raw Audio [van den Oord+, INTERSPEECH 2016], ๏ƒŸ ๏ƒŸ ๏ƒŸ ๏ƒŸ Text ๐’˜ Linguistic feature Linguistic feature ๐’ Speech ๐’™ Direct waveform modeling
  26. WaveNet vocoder Neural vocoder (model parameter: ๐œ†๐‘‰) โ€ข WaveNet vocoder

    [Tamamori+, INTERSPEECH 2017] Direct waveform modeling ๏ƒŸ ๏ƒŸ Text analysis ๐‘ƒ ๐’ ๐’˜, ๐œ†๐ฟ Acoustic model ๐‘ ๐’ ๐’, ๐œ†๐ด Vocoder p ใฎ ๅคช ๅญ— ๆ–œ ไฝ“ โ“œ x , ๅคช ๅญ— ๆ–œ ไฝ“ o , ใ€ ใƒฉ ใƒ  ใƒ€ ไธ‹ ไป˜ ใ ๅคช ๅญ— ๆ–œ ไฝ“ๅคง ๆ–‡ ๅญ— V ใ€ ๅณ ๅฐ ใ‹ ใฃ ใ“ Vocoder ๐‘ ๐’™ ๐’, ๐œ†๐‘ฝ Acoustic feature Acoustic feature ๐’ Linguistic feature Linguistic feature ๐’ Text ๐’˜ Speech ๐’™
  27. WaveNet โ€ข Autoregressive generative model using convolutional NN โ€ข Directly

    modeling speech waveform โ€ข Dilated causal convolution : waveform modeled by using CNN : acoustic and linguistic feature
  28. Speech signal generation model ๐‘ƒ๐‘† ๐‘ง ๐‘ƒ๐ฟ ๐‘ง Long-term predictor

    Short-term predictor Speech signal ๐‘ฅ ๐‘› Excitation signal ๐‘’ ๐‘› It works! ๏ƒŸ ๏ƒŸ Spectrum parameter F0 WaveNet Non-linear predictor F0 Spectrum parameter (๏ƒŸ) (โ†’) >> >> (๏ƒŸ) (๏ƒŸ) ๐’ (initially, language feature ๐’) or Mel-spectrogram Excitation signal ๐‘’ ๐‘› Speech signal ๐‘ฅ ๐‘›
  29. Famous words in speech technology (1980s) โ€œEvery time I fire

    a linguist, the performance of the speech recognizer goes upโ€ by Frederick Jelinek โ€œEvery time I fire a speech synthesis researcher, the performance of the speech synthesizer goes upโ€ by ????? ?????
  30. Various Approaches to Neural Waveform Modeling โ€ข Autoregressive โ€ข WaveNet,

    SampleRNN, WaveRNN, โ€ฆ โ€ข Normalizing flow โ€ข WaveGlow, Pallalel WaveNet, ClariNet, FloWaveNet, WaveGrad, โ€ฆ โ€ข Nonautoregressive (+GAN) (+upsampling) โ€ข Parallel WaveNet, Parallel WaveGAN, HiFi-GAN โ€ข diffusion probabilistic model / flow matching โ€ข WaveGrad, PriorGrad, SpecGrad, โ€ฆ โ€ข Combining with source filter model โ€ข LPCNet, ExcitNet, GlotNet, LP-WaveNet, โ€ฆ โ€ข Introducing signal processing technique โ€ข SubbandWaveNet, FFTNet, iSTFTNet, โ€ฆ โ€ข Non-upsampling โ€ข VOCOS, WaveNeXt Exciation-driven approach โ‡• Upsampling-based approach ๏ผ‹Combined with signal processing techniques where models designed for parallel computation
  31. PeriodNet: neural vocoder based on periodic/aperiodic decomposition [Hono+, ICASSP 2020]

    ๏ƒŸ ๏ƒŸ(๏ƒŸ) (๏ƒŸ) โ‘  Periodic component generated by sinusoidal excitation โ‘ก Aperiodic component generated by noise excitation โ‘ข final speech signal โ‘  โ‘ก โ‘ข Acoustic feature Acoustic feature
  32. Modeling temporal structure (omitted in this talk) โ€ข Explicit duration

    models (FastSpeech-type, diffusion-based models) โ€ข Attention-based models (Tacotron-type) โ€ข Monotonic alignment search (e.g., VITS) โ€ข HMM / HSMM-based models (e.g., deep-HSMM) ๏ƒŸ ๏ƒŸ Text analysis ๐‘ƒ ๐’ ๐’˜, ๐œ†๐ฟ Acoustic model ๐‘ ๐’ ๐’, ๐œ†๐ด Vocoder ๐‘ ๐’™ ๐’ Acoustic feature Acoustic feature ๐’ Linguistic feature Linguistic feature ๐’ Text ๐’˜ Speech ๐’™ ๏ผˆGrapheme level๏ผ‰ ๏ผˆFrame level๏ผ‰ ๏ผˆSample level๏ผ‰ (Phoneme level๏ผ‰ โ‡“ ๏ผˆFrame level๏ผ‰
  33. Large-scale pretrained models (+๐›ผ) โ€ข Wav2vec 2.0, HuBERT, Spin (content

    representation) โ€ข SoundStream, EnCodec (neural audio codec / discretization) โ€ข ECAPA-TDNN (speaker embedding) โ€ข BigVGAN (neural vocoder) โ€ข VALL-E (speech generation) โ€ข Whisper (speech recogonition / representation learning) โ€ข BERT/RoBERTa (text representation) โ€ข ใƒปใƒปใƒป
  34. Other important technical issues (not covered today) โ€ข Text analysis

    โ€ข Shared / common datasets โ€ข Text normalization โ€ข Voice conversion / speech conversion โ€ข Physical simulation of sound production โ€ข Increasing complexity of user Interfaces
  35. Societal and Ethical Issues โ€ข Welfare and healthcare applications โ€ข

    Visual impairment / speech disorders โ€ข Overcoming language barriers โ€ข Cross-lingual dubbing (voice-preserving) โ€ข Detection of fake / synthetic speech โ€ข Training with unauthorized data โ€ข Relationship between voice professionals and speech technology
  36. Summary โ€ข We will probably continue to experience moments like

    โ€œWow, it really works!โ€ and โ€œI never imagined it could be used this way.โ€ โ€ข The tension between โ€œexplicit knowledge about speech and languageโ€ and โ€œthe power of dataโ€ will remain a key driving force. โ€ข What makes speech truly fascinating is its deep connection to human perception and emotion. โ€ข The social impact and importance of speech technology will continue to grow. โ€œIs speech research ever ending?โ€
  37. Special thanks โ€ข Supervisors: Satoshi Imai, Tadashi Kitamura, Takao Kobayashi

    โ€ข Colleagues and students: Takashi Masuko, Noboru Miyazaki, Takayoshi Yoshimura, Shinji Sako, Masatsune Tamura, Junichi Yamagishi, Tomoki Toda, Heiga Zen, Kazuhito Koishida, Tetsuya Yamada, Masatsune Tamura, Nobuaki Mizutani, Ryuta Terashima, Akinobu Lee, Keiichiro Oura, Keijiro Saino, Kenichi Nakamura, Yi-Jian Wu, Ling-Hui Chen, Shifeng Pan, Yoshihiko Nankaku , Ranniery Maia, Sayaka Shiota, Chiyomi Miyajima, Kei Hashimoto, Shinji Takaki, Kazuhiro Nakamura, Kei Sawada, Takenori Yoshimura, Daisuke Yamamoto, Yukiya Hono, Takato Fujimoto, โ€ฆ โ€ข Collaborators, mentors, and research colleagues in the speech research community: Junichi Takami, Naoto Iwahashi, Mike Schuster, Satoshi Nakamura, Frank Soong, Mchael Picheny, Simon King, Steve Young, Mari Ostendorf, Alan Black, Alex Acero, Bill Byrne, Phil Woodland, Thomas Hain, Phil Garner, Masataka Goto, Shigeru Katagiri, Sadaoki Furui, Hideki Kenmochi, Kazuya Takeda, Tatsuya Kawahara, Sadaoki Furui, Seiichi Nakagawa, Keikichi Hirose, Tetsunori Kobayashi, Mikko Kurimo, Shigeki Sagayama, Kiyohiro Shikano, Hisashi Kawai, Nobuyuki Nishizawa, Minoru Tsuzaki, Yoichi Yamashita, Nobuaki Minematsu, Mat Shannon, Mark Gales, Kai Yu, John Dines, โ€ฆ Listed in no particular order. My apologies if I have missed anyone.