Interpretable Convolutional Filters with SincNet

Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless, the internal "black-box" representations automatically discovered by current neural architectures often suffer from a lack of interpretability, making of primary interest the study of explainable machine learning techniques. This paper summarizes our recent efforts to develop a more interpretable neural model for directly processing speech from the raw waveform. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture converges faster, performs better, and is more interpretable than standard CNNs.

[1]  Yoshua Bengio,et al.  Object Recognition with Gradient-Based Learning , 1999, Shape, Contour and Grouping in Computer Vision.

[2]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[3]  Ramón Fernández Astudillo,et al.  The DIRHA-GRID corpus: baseline and tools for multi-room distant speech recognition using distributed microphones , 2014, INTERSPEECH.

[4]  Jun Guo,et al.  DNN Filter Bank Cepstral Coefficients for Spoofing Detection , 2017, IEEE Access.

[5]  Seiichi Nakagawa,et al.  A deep neural network integrated with filterbank learning for speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Michael Elad,et al.  Convolutional Neural Networks Analyzed via Convolutional Sparse Coding , 2016, J. Mach. Learn. Res..

[7]  Quanshi Zhang,et al.  Interpreting CNN knowledge via an Explanatory Graph , 2017, AAAI.

[8]  Vishal Passricha,et al.  End-to-End Acoustic Modeling Using Convolutional Neural Networks , 2019 .

[9]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[10]  Mirco Ravanelli,et al.  Deep Learning for Distant Speech Recognition , 2017, ArXiv.

[11]  Maurizio Omologo,et al.  Contaminated speech training methods for robust DNN-HMM distant speech recognition , 2017, INTERSPEECH.

[12]  Yoshua Bengio,et al.  Light Gated Recurrent Units for Speech Recognition , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[13]  Ronald W. Schafer,et al.  Theory and Applications of Digital Speech Processing , 2010 .

[14]  Klaus-Robert Müller,et al.  Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals , 2018, ArXiv.

[15]  Yoshua Bengio,et al.  Twin Regularization for online speech recognition , 2018, INTERSPEECH.

[16]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[17]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[18]  Hye-jin Shim,et al.  A Complete End-to-End Speaker Verification System Using Deep Neural Networks: From Raw Signals to Verification Result , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Driss Matrouf,et al.  Study of the Effect of I-vector Modeling on Short and Mismatch Utterance Duration for Speaker Verification , 2012, INTERSPEECH.

[20]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[21]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Yoshua Bengio,et al.  A network of deep neural networks for Distant Speech Recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tara N. Sainath,et al.  Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[24]  Sébastien Marcel,et al.  Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Sanjeev Khudanpur,et al.  Acoustic Modelling from the Signal Domain Using CNNs , 2016, INTERSPEECH.

[26]  John H. L. Hansen,et al.  Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[28]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[29]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[30]  Maurizio Omologo,et al.  The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[31]  Kong-Aik Lee,et al.  An extensible speaker identification sidekit in Python , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[34]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[35]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[36]  Sébastien Marcel,et al.  On Learning Vocal Tract System Related Speaker Discriminative Information from Raw Signal Using CNNs , 2018, INTERSPEECH.

[37]  Kai Yu,et al.  End-to-end spoofing detection with raw waveform CLDNNS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Hye-jin Shim,et al.  Avoiding Speaker Overfitting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[39]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[40]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[41]  Yoshua Bengio,et al.  Batch-normalized joint training for DNN-based distant speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[42]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[43]  Stéphane Mallat,et al.  Understanding deep convolutional networks , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[44]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[45]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[46]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[47]  Quanshi Zhang,et al.  Visual interpretability for deep learning: a survey , 2018, Frontiers of Information Technology & Electronic Engineering.

[48]  Maurizio Omologo,et al.  A multi-channel corpus for distant-speech interaction in presence of known interferences , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Dimitri Palaz,et al.  Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.

[50]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[51]  Douglas A. Reynolds,et al.  A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[52]  Sridha Sridharan,et al.  i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[53]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Dimitri Palaz,et al.  End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition , 2019, Speech Commun..

[55]  Stephen A. Dyer,et al.  Digital signal processing , 2018, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[56]  Alun D. Preece,et al.  Interpretability of deep learning models: A survey of results , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[57]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[58]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[59]  Yoshua Bengio,et al.  Improving Speech Recognition by Revising Gated Recurrent Units , 2017, INTERSPEECH.

[60]  Maurizio Omologo,et al.  Realistic Multi-Microphone Data Simulation for Distant Speech Recognition , 2016, INTERSPEECH.

[61]  Petros Maragos,et al.  The DIRHA simulated corpus , 2014, LREC.

[62]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  Patrick Kenny,et al.  Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.

[64]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[65]  Iasonas Kokkinos,et al.  Learning Filterbanks from Raw Speech for Phone Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[67]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Maurizio Omologo,et al.  On the selection of the impulse responses for distant-speech recognition based on contaminated speech training , 2014, INTERSPEECH.

[69]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[70]  Shrikanth S. Narayanan,et al.  Modified-prior i-vector estimation for language identification of short duration utterances , 2014, INTERSPEECH.

[71]  Maurizio Omologo,et al.  Impulse response estimation for robust speech recognition in a reverberant environment , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[72]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).