1.Gaussian Mixture Models(GMM)

For a D-dimensional feature vector, x, the mixture density used for the likelihood function is defined as

$p(x|\lambda) = \sum_{i=1}^{M}w_{i}p_{i}(x)$.

The density is a weighted linear combination of $M$ unimodal Gaussian densities , $p_{i}(x)$
$p_{i}(x) = \frac{1}{(2\pi)^{D/2}|\sum _{i}|^{1/2}} exp[-\frac{1}{2}(x-\mu_{i})^{T}(\sum _{i})^{-1}(x-\mu_{i})]$.

2.Support Vector Machines(SVM)

An SVM is a two-class classifier constructed from sums of a kernel function $K(• , •)$

$f(x) = \sum_{i=1}{N}\alpha_{i} t_{i} K(x,x_{i}) + d$.

$t_{i}$ are the ideal outputs, $\sum_{i=1}{N}\alpha_{i} t_{i}=0$ and $\alpha_{i} > 0$,$x_{i}$ are support vectors and obtained from the training set by an optimization process.
$K(. , .)$ is constrained to have certain properties (the Mercer condition), so that it can be expressed as
$K(x,y) = b(x)^{T} b(y)$.

Kernel function examples: http://www.shamoxia.com/html/y2010/2292.html

3.GMM-SVM

GMM Supervectors:given a speaker utterance, GMM-UBM training is performed by MAP adaptation of the means $m_{i}$, and we form a GMM supervector $m = [m_{i}]$

GMM Supervectors Linear Kernel: a natural distance between the two utterances is the KL divergence, while it does n't satisfy the Mercer condition, we consider the following approximation

$\begin{align} D(g_{a}\parallel g_{b}) = &\int_{R^{n}}g_{a}xlog\left(\frac{g_{a}(x)}{g_{b}(x)} \right)dx \\ \leq & \sum_{i=1}^{N} \lambda _{i}D(N(.;m_{i}^{a},\Sigma _{i})\parallel N(.;m_{i}^{b},\Sigma _{i}))\\ = &\frac{1}{2}\sum_{i=1}^{N}\lambda_{i}(m_{i}^{a}-m_{i}^{b})\Sigma ^{-1}(m_{i}^{a}-m_{i}^{b}) \end{align}$

we can define a new distance formula,
$d(m^{a},m^{b}) = \frac{1}{2} \sum_{i=1}^{N}\lambda_{i}(m_{i}^{a}-m_{i}^{b})\Sigma_{i}^{-1}(m_{i}^{a}-m_{i}^{b})$

From the distance, we can find the corresponding inner product which is the kernel function.

4.SVM-NAP

Nuisance Attribute Projection(NAP) attempts to remove the unwanted within-class variation of the observed feature vectors. This is achieved by applying the transform

$y' = P_{n}y = (I-V_{n}V_{n}^{T})y$.

where Vn is the principal directions of within-class variability.

The SVM NAP constructs a new kernel,

$\begin{align} K(m^{a},m^{b}) =& [Pb(m^{a})]^{T}[Pb(m^{b})]\\ =& b(m^{a})^{t}Pb(m^{b})\\ =& b(m^{a})^{t}(I-vv^{t})b(m^{b}) \end{align}$.

where $P$ is a projection ($P^{2} = P$) , $v$ is the direction being removed from the SVM expansion space. $b(.)$ is the SVM expansion, and $\parallel v\parallel ^{2} = 1$.

5.Experiments on NIST SRE 2010

In the experiments, I use the spro4 to extract MFCC and LPCC features, and use Alize to build GMM-UBM and all of the remained tasks.

References

[1]Douglas A. Reynolds, T. F. Quatieri, and R. Dunn, “Speaker verication using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1-3, pp. 19–41, 2000.
[2]W. Campbell, “Generalized linear discriminant sequence kernels for speaker recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 2002, pp. 161–164.
[3]W. M. Campbell, D. E. Sturim, D. A. Reynolds, A. Solomonoff, SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation.
[4]Robbie Vogt, Sachin Kajarekar, Sridha Sridharan, Discriminant NAP for SVM Speaker Recognition.

Original Link: http://ibillxia.github.io/blog/2012/08/13/GMM-SVM-NAP-Method-in-Speaker-Recognition/
Attribution - NON-Commercial - ShareAlike - Copyright © Bill Xia