Detection and reconstruction of clipped speech for speaker recognition

Abstract Clipping is often observed in speech acquisition, due to the limited numerical range or the non-linear compensation of recording devices. The clipping inevitably changes the spectrum of speech signals, and thus partially distorts the speaker information contained in the signal. This paper investigates the impact of signal clipping on speaker recognition, and proposes a simple yet effective clipping detection approach as well as a signal reconstruction approach based on deep neural networks (DNNs). The experiments are conducted on the core test of the NIST SRE2008 task by simulating clipped speech at various clipping rates. The results show that clipping does impact the performance of speaker recognition, but the impact is rather marginal unless the clipping rate is larger than 80%. We also find that the simple distribution-based detection method is capable of detecting clipped speech with a higher accuracy, and the DNN-based reconstruction can achieve promising performance gains for speaker recognition on clipped speech.

[1]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[2]  Michael Elad,et al.  Audio Inpainting , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Dong Yu,et al.  Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Abdelhakim Dahimene,et al.  A Simple Algorithm for the Restoration of Clipped Speech Signal , 2008, Informatica.

[5]  Yuuki Tachioka,et al.  Speech recognition performance estimation for clipped speech based on objective measures , 2014 .

[6]  Changchun Bao,et al.  Clipping detection of audio signals based on kernel Fisher discriminant , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[7]  J M Kates,et al.  Quality ratings for frequency-shaped peak-clipped speech. , 1994, The Journal of the Acoustical Society of America.

[8]  Raymond N. J. Veldhuis,et al.  Adaptive interpolation of discrete-time signals that can be modeled as autoregressive processes , 1986, IEEE Trans. Acoust. Speech Signal Process..

[9]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[10]  Geoffrey E. Hinton,et al.  Visualizing non-metric similarities in multiple maps , 2011, Machine Learning.

[11]  Pierre Leveau,et al.  Sound enhancement using sparse approximation with speclets , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  David Wessel,et al.  Analyzing Drum Patterns Using Conditional Deep Belief Networks , 2012, ISMIR.

[13]  Israel Cohen,et al.  Audio Packet Loss Concealment in a Combined MDCT-MDST Domain , 2007, IEEE Signal Processing Letters.

[14]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[15]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[16]  D. O'Shaughnessy,et al.  Linear predictive coding , 1988, IEEE Potentials.

[17]  V. Hardman,et al.  A survey of packet loss recovery techniques for streaming audio , 1998, IEEE Network.

[18]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[19]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[21]  Dong Yu,et al.  Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Hervé Bourlard,et al.  Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions , 1997, Summer School on Neural Networks.

[24]  Mathieu Lagrange,et al.  Long Interpolation of Audio Signals Using Linear Prediction in Sinusoidal Modeling , 2005 .

[25]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[26]  D J Van Tasell,et al.  Effect of Peak Clipping on Speech Recognition Threshold , 1994, Ear and hearing.

[27]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[28]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[29]  Laurens van der Maaten,et al.  Learning a Parametric Embedding by Preserving Local Structure , 2009, AISTATS.

[30]  Patrick A. Naylor,et al.  Noise-robust detection of peak-clipping in decoded speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Richard M. Stern,et al.  Least squares signal declipping for robust speech recognition , 2014, INTERSPEECH.

[32]  Ivan W. Selesnick,et al.  Least Squares with Examples in Signal Processing 1 , 2013 .

[33]  Yuri Matveev,et al.  Detection of Clipped Fragments in Speech Signals , 2014 .

[34]  Patrick A. Naylor,et al.  Detection of clipping in coded speech signals , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[35]  Kristofer Kjörling,et al.  Spectral Band Replication, a Novel Approach in Audio Coding , 2002 .

[36]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[37]  Dong Yu,et al.  Large vocabulary continuous speech recognition with context-dependent DBN-HMMS , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Paris Smaragdis,et al.  Missing data imputation for spectral audio signals , 2009, 2009 IEEE International Workshop on Machine Learning for Signal Processing.

[40]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[42]  J. C. R. Licklider,et al.  Effects of Amplitude Distortion upon the Intelligibility of Speech , 1946 .

[43]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[44]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[45]  Laurent Jacques,et al.  Consistent iterative hard thresholding for signal declipping , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.