Analysis of CNN-based speech recognition system using raw speech as input

Abstract Automaticspeechrecognitionsystemstypicallymodeltherela-tionship between the acoustic speech signal and the phones intwo separate steps: feature extraction and classier training. Inourrecentworks, wehaveshownthat, intheframeworkofcon-volutionalneuralnetworks(CNN),therelationshipbetweentheraw speech signal and the phones can be directly modeled andASR systems competitive to standard approach can be built. Inthis paper, we rst analyze and show that, between the rst twoconvolutional layers, the CNN learns (in parts) and models thephone-specic spectral envelope information of 2-4 ms speech.Given that we show that the CNN-based approach yields ASRtrends similar to standard short-term spectral based ASR sys-tem under mismatched (noisy) conditions, with the CNN-basedapproach being more robust.Index Terms: automatic speech recognition, convolutionalneural networks, raw signal, robust speech recognition. 1. Introduction State-of-the-art automatic speech recognition (ASR) systemstypically model the relationship between the acoustic speechsignal and the phones in two separate steps, which are op-timized in an independent manner [1]. In a rst step, thespeech signal is transformed into features, usually composed ofa dimensionality reduction phase and an information selectionphase, based on the task-specic knowledge of the phenomena.These two phases have been carefully hand-crafted, leading tostate-of-the-art features such as Mel frequency cepstral coef-cients(MFCCs)orperceptuallinearpredictioncepstralfeatures(PLPs). In a second step, the likelihood of subword units suchas, phonemes is estimated using generative models or discrimi-native models.In recent years, in the hybrid HMM/ANN framework [1],there has been growing interests in using “intermediate” rep-resentations instead of conventional features, such as cepstral-based features, as input for neural networks-based systems.ANNs with deep learning architectures, more precisely, deepneural networks (DNNs) [2, 3], which can yield better systemthan a single hidden layer MLP have been proposed to addressvarious aspects of acoustic modeling. More specically, useof context-dependent phonemes [4, 5]; use of spectral featuresas opposed to cepstral features [6, 7]; CNN-based system withMel lter bank energies as input [8, 9, 10]; combination of dif-ferent features [11], to name a few. Features learning from therawspeechsignalusingneuralnetworks-basedsystemshasalsobeen investigated in [12]. In all these approaches, the features

[1]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[2]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[3]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[4]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[5]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[6]  Steve Young,et al.  The HTK book , 1995 .

[7]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[8]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[9]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[10]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[11]  Geoffrey E. Hinton,et al.  Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[13]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[14]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[18]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[20]  Dimitri Palaz,et al.  End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks , 2013, ArXiv.

[21]  Dimitrios Dimitriadis,et al.  Investigating deep neural network based transforms of robust audio features for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Steve Renals,et al.  Convolutional Neural Networks for Distant Speech Recognition , 2014, IEEE Signal Processing Letters.

[23]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[24]  Dimitri Palaz,et al.  Convolutional Neural Networks-based continuous speech recognition using raw speech signal , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  eon BottouAT Stochastic Gradient Learning in Neural Networks , 2022 .