LEAF: A Learnable Frontend for Audio Classification

Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio signals, including speech, music, audio events and animal sounds, providing a general-purpose learned frontend for audio classification. To do so, we introduce a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks. Our system learns all operations of audio features extraction, from filtering to pooling, compression and normalization, and can be integrated into any neural network at a negligible parameter cost. We perform multi-task training on eight diverse audio classification tasks, and show consistent improvements of our model over mel-filterbanks and previous learnable alternatives. Moreover, our system outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.

[1]  Hynek Hermansky,et al.  Automatic Speech Recognition: an Auditory Perspective , 2004 .

[2]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[6]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[7]  Vincent Lostanlen,et al.  Birdvox-Full-Night: A Dataset and Benchmark for Avian Flight Call Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Antoine Deleforge,et al.  Filterbank Design for End-to-end Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Leon Cohen,et al.  Fitting the Mel scale , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Iasonas Kokkinos,et al.  Learning Filterbanks from Raw Speech for Phone Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  D. Gabor,et al.  Theory of communication. Part 1: The analysis of information , 1946 .

[13]  Gustav Theodor Fechner,et al.  Elements of psychophysics , 1966 .

[14]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[15]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Titouan Parcollet,et al.  CGCNN: Complex Gabor Convolutional Neural Network on Raw Speech , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[18]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[19]  Satoshi Nakamura,et al.  Attention-based Wav2Text with feature transfer learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Yashesh Gaur,et al.  Reducing Bias in Production Speech Models , 2017, ArXiv.

[21]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[22]  Geoffrey E. Hinton,et al.  Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Alan McCree,et al.  State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations , 2020, Comput. Speech Lang..

[24]  Hermann Ney,et al.  Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[25]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Shauna Revay,et al.  Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals , 2019, ArXiv.

[27]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Hervé Glotin,et al.  Spline Filters For End-to-End Deep Learning , 2018, ICML.

[29]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Dimitri Palaz,et al.  Convolutional Neural Networks-based continuous speech recognition using raw speech signal , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Edouard Grave,et al.  End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[33]  Hervé Glotin,et al.  Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge , 2018, Methods in Ecology and Evolution.

[34]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Tiago H. Falk,et al.  Modulation Spectral Features for Robust Far-Field Speaker Identification , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[37]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[38]  Bernhard Lehner,et al.  Zero-Mean Convolutions for Level-Invariant Singing Voice Detection , 2018, ISMIR.

[39]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[40]  Kuldip K. Paliwal,et al.  Effect of compressing the dynamic range of the power spectrum in modulation filtering based speech enhancement , 2008, INTERSPEECH.

[41]  Tara N. Sainath,et al.  Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[42]  S. S. Stevens,et al.  The Relation of Pitch to Frequency: A Revised Scale , 1940 .

[43]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[44]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[45]  D. D. Greenwood The Mel Scale's disqualifying bias and a consistency of pitch-difference equisections in 1956 with equal cochlear distances and equal frequency ratios , 1997, Hearing Research.

[46]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47]  Richard F. Lyon,et al.  Trainable frontend for robust and far-field keyword spotting , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[49]  Hynek Hermansky,et al.  Hilbert Envelope Based Features for Far-Field Speech Recognition , 2008, MLMI.

[50]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[51]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[52]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[53]  Sanjiv Kumar,et al.  Now Playing: Continuous low-power music recognition , 2017, ArXiv.

[54]  Vincent Lostanlen,et al.  Per-Channel Energy Normalization: Why and How , 2019, IEEE Signal Processing Letters.

[55]  Richard Zhang,et al.  Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[56]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[57]  Nicolas Usunier,et al.  End-to-End Speech Recognition From the Raw Waveform , 2018, INTERSPEECH.

[58]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.