Deep Within-Class Covariance Analysis for Acoustic Scene Classification

Within-Class Covariance Normalization (WCCN) is a powerful post-processing method for normalizing the within-class covariance of a set of data points. WCCN projects the observations into a linear sub-space where the within-class variability is reduced. This property has proven to be beneficial in subsequent recognition tasks. The central idea of this paper is to reformulate the classic WCCN as a Deep Neural Network (DNN) compatible version. We propose the Deep WithinClass Covariance Analysis (DWCCA) which can be incorporated in a DNN architecture. This formulation enables us to exploit the beneficial properties of WCCN, and still allows for training with Stochastic Gradient Descent (SGD) in an end-to-end fashion. We investigate the advantages of DWCCA on deep neural networks with convolutional layers for supervised learning. Our results on Acoustic Scene Classification show that via DWCCA we can achieves equal or superior performance in a VGG-style deep neural network.

[1]  Xiaodan Zhuang,et al.  Modeling audio and visual cues for real-world event detection , 2011 .

[2]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[3]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[4]  James R. Glass,et al.  Non-Negative Factor Analysis of Gaussian Mixture Model Weight Adaptation for Language and Dialect Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[6]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[7]  Kanter,et al.  Eigenvalues of covariance matrices: Application to neural-network learning. , 1991, Physical review letters.

[8]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Gerhard Widmer,et al.  Deep Linear Discriminant Analysis , 2015, ICLR.

[13]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS FOR DCASE-2016 : A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS , 2016 .

[14]  Tuomas Virtanen,et al.  Assessment of human and machine performance in acoustic scene classification: Dcase 2016 case study , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[15]  Oren Barkan,et al.  Fast High Dimensional Vector Multiplication Face Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[17]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[18]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  S. Essid,et al.  SUPERVISED NONNEGATIVE MATRIX FACTORIZATION FOR ACOUSTIC SCENE CLASSIFICATION , 2016 .

[20]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[21]  S. P. Smith Differentiation of the Cholesky Algorithm , 1995 .

[22]  Franz Pernkopf,et al.  Gated Recurrent Networks applied to Acoustic Scene Classification , 2016, DCASE.

[23]  Florian Krebs,et al.  madmom: A New Python Audio and Music Signal Processing Library , 2016, ACM Multimedia.

[24]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[25]  S. Squartini,et al.  DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks , 2016, DCASE.

[26]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[27]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[28]  Karen Livescu,et al.  Large-Scale Approximate Kernel Canonical Correlation Analysis , 2015, ICLR.

[29]  Andreas Stolcke,et al.  Generalized Linear Kernels for One-Versus-All Classification: Application to Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[30]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[31]  Razvan Pascanu,et al.  Theano: Deep Learning on GPUs with Python , 2012 .

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Shrikanth S. Narayanan,et al.  Speaker verification using simplified and supervised i-vector modeling , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.