Generative models: Variational Autoencoder vs Gaussian Mixture Models for voice anti spoofing
In this study, Bhusan Chettri explores the feasibility of generative models such as Variational Autoencoder for voice anti-spoofing and further compares their performances with the Gaussian Mixture Models, another form of generative models.
Bhusan Chettri1,2, Tomi Kinnunen2, Emmanouil Benetos1 1School of EECS, Queen Mary University of London, United Kingdom 2School of computing, University of Eastern Finland, Joensuu, Finland November 20, 2019
to the biometric system of a registered user. Types of attacks: 1. Text-to-Speech 2. Voice conversion 3. Impersonation/Mimicry 4. Replay Countermeasures: guard ASV systems from spoofing attacks. Consists of: Frontend: extracts discriminative features Backend: classification and decision making.
an unsupervised generative model Why VAEs for spoofing detection? widely used in other domains (eg. computer vision) an unsupervised deep generative model ability to generate data analyse/manipulate latent space - interpretability! Main objectives Feasibility study of using VAEs as a backend classifier: 2-class setting as in GMMs. (this paper) One-class VAE: model true/bonafide data distribution!
a regularized log-likelihood function. Let X = {xn}N n=1 denote the training set, with xn ∈ RD. The training loss for the entire training set X, L(θ, φ) = N n=1 n(θ, φ), (1) decomposes to a sum of data-point specific losses. The loss of the nth training example is a regularized reconstruction loss: n(θ, φ) = −Ez∼qφ(z|xn) log pθ(xn|z) Reconstruction error + KL qφ(z|xn) p(z) Regularizer , (2) where, φ and θ represents encoder and decoder network parmeters. Testing: we use the same loss function Eq. 2 during scoring.
2019 PA 2. Input representation: 100 × D Constant Q cepstral coeffiient (CQCC) Log power spectrogram 3. Architecture Deep CNN with convolutional and deconvolutional layers for Encoder and decoder networks. no pooling layers but stride > 1 4. Scoring/Testing Log-likelihood difference between bonafide and spoof model Higher the score, higher is the probability of being bonafide 5. Evaluation metric Equal error rate (EER) Tandem detection cost function (t-DCF)
bonafide/genuine, spf: spoof/replay. Each of the three subsets has non-overlapping speakers. The ASVspoof 2017 dataset has male speakers only while the ASVspoof 2019 has both male and female speakers. ASVspoof 2017 ASVspoof 2019 PA Subset # Spkr # Bon # Spf # Spkr # Bon # Spf Train 10 1507 1507 20 5400 48600 Dev 8 760 950 20 5400 24300 Eval 24 1298 12008 67 18090 116640 Total 42 3565 14465 107 28890 189540
Making the latent space ‘z’ retain discriminative information 2. Vanilla VAE approach did not work well bonafide and spoof VAE model seem to focus on retaining information relevant for reconstruction 3. C-VAE models show encouraging results 4. Use of an auxiliary classifier with C-VAE did not help much parameter not optimised well! room for exploration and improvement 5. Frame-level C-VAE 6. One-class or semi-supervised C-VAE approach