Discriminative frequency filter banks learning with neural networks

Filter banks on spectrums play an important role in many audio applications. Traditionally, the filters are linearly distributed on perceptual frequency scale such as Mel scale. To make the output smoother, these filters are often placed so that they overlap with each other. However, fixed-parameter filters are usually in the context of psychoacoustic experiments and selected experimentally. To make filter banks discriminative, the authors use a neural network structure to learn the frequency center, bandwidth, gain, and shape of the filters adaptively when filter banks are used as a feature extractor. This paper investigates several different constraints on discriminative frequency filter banks and the dual spectrum reconstruction problem. Experiments on audio source separation and audio scene classification tasks show performance improvements of the proposed filter banks when compared with traditional fixed-parameter triangular or gaussian filters on Mel scale. The classification errors on LITIS ROUEN dataset and DCASE2016 dataset are reduced by 13.9% and 4.6% relatively.

[1]  Bin Gao,et al.  Cochleagram-based audio pattern separation using two-dimensional non-negative matrix factorization with automatic sparsity adaptation. , 2014, The Journal of the Acoustical Society of America.

[2]  Biing-Hwang Juang,et al.  An application of discriminative feature extraction to filter-bank-based speech recognition , 2001, IEEE Trans. Speech Audio Process..

[3]  Jun Guo,et al.  DNN Filter Bank Cepstral Coefficients for Spoofing Detection , 2017, IEEE Access.

[4]  S Rosen,et al.  Auditory filter nonlinearity at 2 kHz in normal hearing listeners. , 1998, The Journal of the Acoustical Society of America.

[5]  Roy D. Patterson,et al.  A Dynamic Compressive Gammachirp Auditory Filterbank , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[7]  Mark D. Plumbley,et al.  Deep Neural Network Baseline for DCASE Challenge 2016 , 2016, DCASE.

[8]  Seiichi Nakagawa,et al.  A deep neural network integrated with filterbank learning for speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Roger Hsiao,et al.  Discriminative training of auditory filters of different shapes for robust speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[11]  S. S. Stevens,et al.  The Relation of Pitch to Frequency: A Revised Scale , 1940 .

[12]  Daniele Battaglino,et al.  Acoustic scene classification using convolutional neural networks , 2016 .

[13]  Steffen Roch,et al.  C* - Algebras and Numerical Analysis , 2000 .

[14]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[15]  Adi Ben-Israel,et al.  Generalized inverses: theory and applications , 1974 .

[16]  M. James,et al.  The generalised inverse , 1978, The Mathematical Gazette.

[17]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[19]  Jont B. Allen,et al.  Short term spectral analysis, synthesis, and modification by discrete Fourier transform , 1977 .

[20]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[21]  Hemant A. Patil,et al.  Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Meir Tzur,et al.  Speech reconstruction from mel frequency cepstral coefficients and pitch frequency , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Alain Biem,et al.  Feature extraction based on minimum classification error/generalized probabilistic descent method , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Nicki Holighaus,et al.  Theory, implementation and applications of nonstationary Gabor frames , 2011, J. Comput. Appl. Math..

[26]  Thibaud Necciari,et al.  A Perceptually Motivated Filter Bank with Perfect Reconstruction for Audio Signal Processing , 2016, ArXiv.

[27]  E. Lopez-Poveda,et al.  A human nonlinear cochlear filterbank. , 2001, The Journal of the Acoustical Society of America.

[28]  Steve Young,et al.  The HTK book , 1995 .

[29]  P. P. Vaidyanathan,et al.  New results and open problems on nonuniform filter-banks , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[30]  Xu Shao,et al.  Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model , 2002, INTERSPEECH.

[31]  R. Fay,et al.  Auditory perception of sound sources , 2007 .

[32]  Richard F. Lyon,et al.  History and future of auditory filter models , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[33]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.

[34]  Takumi Kobayashi,et al.  Discriminatively learned filter bank for acoustic features , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  E. Zwicker,et al.  Analytical expressions for critical‐band rate and critical bandwidth as a function of frequency , 1980 .

[36]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[37]  Jonathan Le Roux,et al.  Consistent Wiener Filtering for Audio Source Separation , 2013, IEEE Signal Processing Letters.

[38]  Ingrid Daubechies,et al.  The wavelet transform, time-frequency localization and signal analysis , 1990, IEEE Trans. Inf. Theory.

[39]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[40]  Huy Phan,et al.  Audio Scene Classification with Deep Recurrent Neural Networks , 2017, INTERSPEECH.

[41]  Piotr Majdak,et al.  A time-frequency method for increasing the signal-to-noise ratio in system identification with exponential sweeps , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[43]  Kai Yu,et al.  Deep features for automatic spoofing detection , 2016, Speech Communication.

[44]  Alfred Mertins,et al.  Analysis and design of gammatone signal models. , 2009, The Journal of the Acoustical Society of America.

[45]  Huy Phan,et al.  Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[46]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[47]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[48]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[49]  Pravin Varaiya,et al.  Bounded-input bounded-output stability of nonlinear time-varying differential systems. , 1966 .

[50]  DeLiang Wang,et al.  CASA-Based Robust Speaker Identification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Hemant A. Patil,et al.  Filterbank learning using Convolutional Restricted Boltzmann Machine for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  David G. Stork,et al.  Pattern Classification , 1973 .

[53]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[54]  Bo Chen,et al.  Robust deep feature for spoofing detection - the SJTU system for ASVspoof 2015 challenge , 2015, INTERSPEECH.

[55]  Biing-Hwang Juang,et al.  Discriminative feature extraction for speech recognition , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[56]  Frank Lad,et al.  Two Moments of the Logitnormal Distribution , 2008, Commun. Stat. Simul. Comput..

[57]  Xu Shao,et al.  Prediction of Fundamental Frequency and Voicing From Mel-Frequency Cepstral Coefficients for Unconstrained Speech Reconstruction , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Antonio M. Peinado,et al.  An application of minimum classification error to feature space transformations for speech recognition , 1996, Speech Commun..