using Nonequilibrium Thermodynamics," ICML 2015. [2] J. Ho+, "Denoising Diffusion Probabilistic Models," NeurIPS 2020. [3] A. Nichol+, "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models," arXiv:2112.10741, 2021. [4] A. Ramesh+, "Hierarchical Text-Conditional Image Generation with CLIP Latents," https://cdn.openai.com/papers/dall-e-2.pdf, 2022. [5] N. Chen+, “WaveGrad: Estimating Gradients for Waveform Generation,” ICLR, 2021. [6] Z. Kong+, “DiffWave: A Versatile Diffusion Model for Audio Synthesis,” ICLR, 2021. [7] D. P. Kingma+, "Variational Diffusion Models," NeurIPS, 2021. [8] S. Lee+, "PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior," ICLR, 2022. [9] Y. Koizumi+, "SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping," Interspeech, 2022. [10] T. Kusano+, "Designing Nearly Tight Window for Improving Time-Frequency Masking," ICA, 2019. [11] W. A. Jassim+, "WARP-Q: Quality Prediction for Generative Neural Speech Codecs," ICASSP, 2021 [12] T. Okamoto+, "Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders," ICASSP, 2021 [13] S. Maiti+, "Parametric Resynthesis with Neural Vocoders," WASPAA, 2019 [14] Y. Koizumi+, "DF-Conformer: Integrated Architecture of Conv-TasNet and Conformer using Linear Complexity Self-Attention for Speech Enhancement," WASPAA, 2021 [15] S. Wang+, "A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis," ICASSP, 2021 [16] J. Jensen+, "An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers," IEEE TASLP, 2016.