Speech and Speaker Recognition from Raw Waveform with SincNet

Deep neural networks can learn complex and abstract representations, that are progressively obtained by combining simpler ones. A recent trend in speech and speaker recognition consists in discovering these representations starting from raw audio samples directly. Differently from standard hand-crafted features such as MFCCs or FBANK, the raw waveform can potentially help neural networks discover better and more customized representations. The high-dimensional raw inputs, however, can make training significantly more challenging. This paper summarizes our recent efforts to develop a neural architecture that efficiently processes speech from audio waveforms. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture converges faster, performs better, and is more computationally efficient than standard CNNs.

[1]  Yoshua Bengio,et al.  Twin Regularization for online speech recognition , 2018, INTERSPEECH.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Iasonas Kokkinos,et al.  Learning Filterbanks from Raw Speech for Phone Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yoshua Bengio,et al.  Batch-normalized joint training for DNN-based distant speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[5]  John H. L. Hansen,et al.  Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Hye-jin Shim,et al.  A Complete End-to-End Speaker Verification System Using Deep Neural Networks: From Raw Signals to Verification Result , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Maurizio Omologo,et al.  The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[9]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[10]  Kai Yu,et al.  End-to-end spoofing detection with raw waveform CLDNNS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Hye-jin Shim,et al.  Avoiding Speaker Overfitting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[12]  Sébastien Marcel,et al.  Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yoshua Bengio,et al.  Light Gated Recurrent Units for Speech Recognition , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[14]  Jun Guo,et al.  DNN Filter Bank Cepstral Coefficients for Spoofing Detection , 2017, IEEE Access.

[15]  Sanjeev Khudanpur,et al.  Acoustic Modelling from the Signal Domain Using CNNs , 2016, INTERSPEECH.

[16]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[18]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[19]  Seiichi Nakagawa,et al.  A deep neural network integrated with filterbank learning for speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tara N. Sainath,et al.  Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[21]  Yoshua Bengio,et al.  Improving Speech Recognition by Revising Gated Recurrent Units , 2017, INTERSPEECH.

[22]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[23]  Maurizio Omologo,et al.  Realistic Multi-Microphone Data Simulation for Distant Speech Recognition , 2016, INTERSPEECH.

[24]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[25]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[26]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[29]  Patrick Kenny,et al.  Deep Speaker Embeddings for Short-Duration Speaker Verification , 2017, INTERSPEECH.

[30]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[31]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[33]  Dimitri Palaz,et al.  Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.

[34]  Yun Lei,et al.  Advances in deep neural network approaches to speaker recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[36]  Mirco Ravanelli,et al.  Deep Learning for Distant Speech Recognition , 2017, ArXiv.