Robust Speaker Verification Based on Max Pooling of Sparse Representation

In the human nervous system, sensory inputs are coded in a sparse manner where only small num- bers of neurons are active at a given time, thus the sparse coding is reasonable to be as a plausible model of the auditory cortex. In this paper, we propose a biologically inspired feature extraction method for speaker verification based on sparse coding. When encoding the speech data using sparse coding model, the learned dictionary has the similar characteristics with simple cell receptive fields of auditory neurons and the sparse coding coefficients simulate the response of the auditory cortex neuron. Moreover, every dictionary is learned from every speaker training sample, so that it has more individual information of the speaker and is useful for discriminating different speakers with less dictionary atoms. And based on human auditory masking effect, a neuron which performs a Max Pooling operation on the pooled inputs responds to the strongest one of its in- puts and inhibits other weaker inputs. The robustness of the proposed method is better in terms of a strategy to represent natural sounds. The experimental results show that the proposed method outperforms the baseline system on two typical corpuses.

[1]  Tara N. Sainath,et al.  Bayesian compressive sensing for phonetic classification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Emmanuel J. Candès,et al.  New multiscale transforms, minimum total variation synthesis: applications to edge-preserving image reconstruction , 2002, Signal Process..

[4]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[5]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[6]  Ke Huang,et al.  Sparse Representation for Signal Classification , 2006, NIPS.

[7]  Patrick Kenny,et al.  A Joint Factor Analysis Approach to Progressive Model Adaptation in Text-Independent Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  S. Mallat A wavelet tour of signal processing , 1998 .

[9]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[10]  Michael S. Lewicki,et al.  Efficient coding of natural sounds , 2002, Nature Neuroscience.

[11]  Rohit Sinha,et al.  Exploring Sparse Representation Classification for Speaker Verification in Realistic Environment , 2011 .

[12]  Nuno Vasconcelos,et al.  A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications , 2003, NIPS.

[13]  Alvin F. Martin,et al.  The NIST speaker recognition evaluation program , 2005 .

[14]  Prashant Parikh A Theory of Communication , 2010 .

[15]  Shane F. Cotter,et al.  Sparse Representation for accurate classification of corrupted and occluded facial expressions , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[17]  E. Ambikairajah,et al.  Speaker verification using sparse representation classification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  J CandèsEmmanuel,et al.  New multiscale transforms, minimum total variation synthesis , 2002 .

[19]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[20]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[21]  Harvey Fletcher A SPACE‐TIME PATTERN THEORY OF HEARING , 1930 .

[22]  Bruno A. Olshausen,et al.  A new window on sound , 2002, Nature Neuroscience.

[23]  William M. Campbell,et al.  Advances in channel compensation for SVM speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[24]  Julian Fiérrez,et al.  Speaker verification using speaker- and test-dependent fast score normalization , 2007, Pattern Recognit. Lett..

[25]  Yonghong Yan,et al.  Speaker Verification Using Sparse Representations on Total Variability i-vectors , 2011, INTERSPEECH.

[26]  P. Laguna,et al.  Signal Processing , 2002, Yearbook of Medical Informatics.

[27]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[28]  Douglas A. Reynolds,et al.  Channel robust speaker verification via feature mapping , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[29]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[30]  Stanley A. Gelfand,et al.  Hearing: An Introduction to Psychological and Physiological Acoustics, Fourth Edition , 1998 .