Generative models: Variational Autoencoder vs Gaussian Mixture Models for voice anti spoofing

Variational Autoencoders for Replay Spoof Detection in Automatic Speaker Veriﬁcation
Bhusan Chettri1,2, Tomi Kinnunen2, Emmanouil Benetos1 1School of EECS, Queen Mary University of London, United Kingdom 2School of computing, University of Eastern Finland, Joensuu, Finland November 20, 2019

Outline Introduction/background Motivation Methodology Experimental setup Results Conclusion

Automatic speaker veriﬁcation (ASV) Is the speaker who he/she claims
to be? APPLICATION: User authentication (eg. banks, call centres, smart phones etc.)

ASV spoofing and countermeasures Spoofing: attempting to gain unauthorised access
to the biometric system of a registered user. Types of attacks: 1. Text-to-Speech 2. Voice conversion 3. Impersonation/Mimicry 4. Replay Countermeasures: guard ASV systems from spoofing attacks. Consists of: Frontend: extracts discriminative features Backend: classification and decision making.

Why spooﬁng countermeasures? Taken from “http://www.asvspoof.org/slides ASVspoof2017 Interspeech.pdf” with permission.

Towards securing ASV systems We focus on “Replay spooﬁng attack”
simple to perform yet diﬃcult to detect reliably

Genuine/bonaﬁde vs replayed speech

Overview of this work Figure 1: High level overview. An
automatic spooﬁng detection pipeline using a generative model backend.

Motivation and objectives Gaussian mixture models (GMMs) popular backend classifier
an unsupervised generative model Why VAEs for spoofing detection? widely used in other domains (eg. computer vision) an unsupervised deep generative model ability to generate data analyse/manipulate latent space - interpretability! Main objectives Feasibility study of using VAEs as a backend classifier: 2-class setting as in GMMs. (this paper) One-class VAE: model true/bonafide data distribution!

Methodology We study diﬀerent variants of VAEs as a backend.
Vanilla VAE Conditional VAE (C-VAE) C-VAE with an auxiliary classiﬁer

Variational Autoencoders (VAE) Figure 2: Naive VAE. Separate bonaﬁde and
spoof VAE models are trained using the respective-class training audio ﬁles.

VAE training and testing The VAE is trained by maximizing
a regularized log-likelihood function. Let X = {xn}N n=1 denote the training set, with xn ∈ RD. The training loss for the entire training set X, L(θ, φ) = N n=1 n(θ, φ), (1) decomposes to a sum of data-point speciﬁc losses. The loss of the nth training example is a regularized reconstruction loss: n(θ, φ) = −Ez∼qφ(z|xn) log pθ(xn|z) Reconstruction error + KL qφ(z|xn) p(z) Regularizer , (2) where, φ and θ represents encoder and decoder network parmeters. Testing: we use the same loss function Eq. 2 during scoring.

Conditional VAE (C-VAE) Figure 3: C-VAE. A single model is
trained on the whole training dataset but with class labels.

Auxiliary classiﬁer C-VAE (AC-VAE) Figure 4: AC-VAE. Add an auxiliary
classiﬁer on the latent mean or the decoder output. Loss function: n(θ, φ, ψ) = α · n(θ, φ) + β · n(ψ), (3)

Experimental setup 1. Replay spoofing dataset ASVspoof 2017 v2.0 ASVspoof
2019 PA 2. Input representation: 100 × D Constant Q cepstral coeffiient (CQCC) Log power spectrogram 3. Architecture Deep CNN with convolutional and deconvolutional layers for Encoder and decoder networks. no pooling layers but stride > 1 4. Scoring/Testing Log-likelihood difference between bonafide and spoof model Higher the score, higher is the probability of being bonafide 5. Evaluation metric Equal error rate (EER) Tandem detection cost function (t-DCF)

Replay spooﬁng corpora Table 1: Database statistics. Spkr: speaker. Bon:
bonaﬁde/genuine, spf: spoof/replay. Each of the three subsets has non-overlapping speakers. The ASVspoof 2017 dataset has male speakers only while the ASVspoof 2019 has both male and female speakers. ASVspoof 2017 ASVspoof 2019 PA Subset # Spkr # Bon # Spf # Spkr # Bon # Spf Train 10 1507 1507 20 5400 48600 Dev 8 760 950 20 5400 24300 Eval 24 1298 12008 67 18090 116640 Total 42 3565 14465 107 28890 189540

Quantitative results Table 2: Performance of GMM and different VAE
models. AC-VAE1 : augmenting classifier on top of the latent space. AC-VAE2 : augmenting classifier at the output of the decoder. Lower the better. ASVspoof 2017 ASVspoof 2019 PA Dev Eval Dev Eval Model EER t-DCF EER t-DCF EER t-DCF EER t-DCF GMM 19.07 0.4365 22.6 0.6211 43.77 0.9973 45.48 0.9988 VAE 29.2 0.7532 32.37 0.8079 45.24 0.9855 45.53 0.9978 C-VAE 18.1 0.4635 28.1 0.7020 34.06 0.8129 36.66 0.9104 AC-VAE1 21.8 0.4914 29.3 0.7365 34.73 0.8516 36.42 0.9036 AC-VAE2 17.78 0.4469 29.73 0.7368 34.87 0.8430 36.42 0.8963

Qualitative results Figure 5: Left: C-VAE with genuine class conditioning.
Right: spoof-class conditioning. Top: bonaﬁde example. Bottom: spoof example.

Figure 6: t-SNE visualisation: Top left: 10 speaker clusters. Top
right: male and female clusters Bottom left and right: distribution of bonaﬁde and 9 attack conditions for a male and a female speaker.

t-SNE - diﬀerent phrases in ASVspoof 2017 Figure 7: Visualisation
of the latent space for 10 diﬀerent sentences in the ASVspoof 2017 training set by C-VAE.

Conclusions and future work 1. Challenges Getting a reasonable reconstruction
Making the latent space ‘z’ retain discriminative information 2. Vanilla VAE approach did not work well bonaﬁde and spoof VAE model seem to focus on retaining information relevant for reconstruction 3. C-VAE models show encouraging results 4. Use of an auxiliary classiﬁer with C-VAE did not help much parameter not optimised well! room for exploration and improvement 5. Frame-level C-VAE 6. One-class or semi-supervised C-VAE approach

References [1] https://jaan.io/what-is-variational-autoencoder-vae-tutorial/ [2] https://arxiv.org/pdf/1606.05908.pdf

Questions

Generative models: Variational Autoencoder vs G...

Generative models: Variational Autoencoder vs Gaussian Mixture Models for voice anti spoofing

Bhusan Chettri

More Decks by Bhusan Chettri

Other Decks in Research

Featured

Transcript

Variational Autoencoders for Replay Spoof Detection in Automatic Speaker Veriﬁcation

Outline Introduction/background Motivation Methodology Experimental setup Results Conclusion

Automatic speaker veriﬁcation (ASV) Is the speaker who he/she claims

ASV spooﬁng and countermeasures Spooﬁng: attempting to gain unauthorised access

Why spooﬁng countermeasures? Taken from “http://www.asvspoof.org/slides ASVspoof2017 Interspeech.pdf” with permission.

Towards securing ASV systems We focus on “Replay spooﬁng attack”

Genuine/bonaﬁde vs replayed speech

Overview of this work Figure 1: High level overview. An

Motivation and objectives Gaussian mixture models (GMMs) popular backend classiﬁer

Methodology We study diﬀerent variants of VAEs as a backend.

Variational Autoencoders (VAE) Figure 2: Naive VAE. Separate bonaﬁde and

VAE training and testing The VAE is trained by maximizing

Conditional VAE (C-VAE) Figure 3: C-VAE. A single model is

Auxiliary classiﬁer C-VAE (AC-VAE) Figure 4: AC-VAE. Add an auxiliary

Experimental setup 1. Replay spooﬁng dataset ASVspoof 2017 v2.0 ASVspoof

Replay spooﬁng corpora Table 1: Database statistics. Spkr: speaker. Bon:

Quantitative results Table 2: Performance of GMM and diﬀerent VAE

Qualitative results Figure 5: Left: C-VAE with genuine class conditioning.

Figure 6: t-SNE visualisation: Top left: 10 speaker clusters. Top

t-SNE - diﬀerent phrases in ASVspoof 2017 Figure 7: Visualisation

Conclusions and future work 1. Challenges Getting a reasonable reconstruction

References [1] https://jaan.io/what-is-variational-autoencoder-vae-tutorial/ [2] https://arxiv.org/pdf/1606.05908.pdf

Questions