Adaptive DCTNet for audio signal classification

In this paper, we investigate DCTNet for audio signal classification. Its output feature is related to Cohen's class of time-frequency distributions. We introduce the use of adaptive DCTNet (A-DCTNet) for audio signals feature extraction. The A-DCTNet applies the idea of constant-Q transform, with its center frequencies of filterbanks geometrically spaced. The A-DCTNet is adaptive to different acoustic scales, and it can better capture low frequency acoustic information that is sensitive to human audio perception than features such as Mel-frequency spectral coefficients (MFSC). We use features extracted by the A-DCTNet as input for classifiers. Experimental results show that the A-DCTNet and Recurrent Neural Networks (RNN) achieve state-of-the-art performance in bird song classification rate, and improve artist identification accuracy in music data. They demonstrate A-DCTNet's applicability to signal processing problems.

[1]  H. Sebastian Seung,et al.  Natural Image Denoising with Convolutional Networks , 2008, NIPS.

[2]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[3]  Ingrid Daubechies,et al.  The wavelet transform, time-frequency localization and signal analysis , 1990, IEEE Trans. Inf. Theory.

[4]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[5]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[6]  Daniel P. W. Ellis,et al.  Classifying Music Audio with Timbral and Chroma Features , 2007, ISMIR.

[7]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[8]  Andrew Beng Jin Teoh,et al.  DCTNet: A simple learning-free approach for face recognition , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[9]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[10]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[11]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[12]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[13]  William J. Williams,et al.  Shift covariant time-frequency distributions of discrete signals , 1999, IEEE Trans. Signal Process..

[14]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.

[15]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  John S. D. Mason,et al.  On the limitations of cepstral features in noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.