Raw Speech Signal-based Continuous Speech Recognition using Convolutional Neural Networks

Abstract State-of-the-art automatic speech recognition systems model the relationship be-tween acoustic speech signal and phone classes in two stages, namely, extractionofspectral-basedfeaturesbasedonpriorknowledgefollowedbytrainingofacous-tic model, typically an articial neural network (ANN). In a recent work, it wasshownthatconvolutionneuralnetworks(CNNs)arecapableofmodelingtherela-tionbetweenacousticspeechsignalandphoneclassesdirectly. Thispaperextendsthe CNN-based approach to large vocabulary speech recognition task. More pre-cisely,wecomparetheCNN-basedapproachagainsttheconventionalANN-basedapproachonWallStreetJournalcorpus. OurstudiesshowthattheCNN-basedap-proachwithfewerparametersachievesperformancecomparableorbetterthantheconventional ANN-based approach. 1 Introduction State-of-the-art Automatic speech recognition (ASR) systems typically divide the task into severalsub-tasks, which are optimized in an independent manner [1]. In a rst step, the data is transformedinto features, usually composed of a dimensionality reduction phase and an information selectionphase, based on the task-specic knowledge of the phenomena. These two phases have been care-fully hand-crafted, leading to state-of-the-art features such as mel frequency cepstral coefcients(MFCCs) [2] or perceptual linear prediction cepstral features (PLPs) [3]. In a second step, the like-lihood of subword units such as, phonemes is estimated using generative models or discriminativemodels. In a nal step, dynamic programming techniques are used to recognize the word sequencegiven the lexical and syntactical constraints.Recent advances in machine learning have made possible systems that can be trained in an end-to-end manner, i.e. systems where every step islearned simultaneously, taking into account allthe other steps and the nal task of the whole system. It is typically referred to asdeep learning,mainly because such architectures are usually composed of many layers (supposed to provide anincreasing level of abstraction), compared to classical “shallow” systems. As opposed to “divideand conquer” approaches presented previously (where each step is independently optimized) deeplearning approaches are often claimed to have the potential to lead to more optimal systems, and tohave the advantage to alleviate the need of nd the right features for a given task of interest. Whilethere is a good success record of such approaches in the computer vision [4] or text processingelds [5], deep learning approaches for speech recognition still rely on spectral-based features suchas MFCC [6]. Some systems have proposed to learn features from “intermediate” representation ofspeech, like mel lter bank energies and their temporal derivatives.Inarecentstudy[7], itwasshownthatitispossibletoestimatephonemeclassconditionalprobabil-ities by using raw speech signal as input to convolutional neural networks [8] (CNNs). On TIMITphonemerecognitiontask, itwasshownthatthesystemisabletolearnfeaturesfromtherawspeech1

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[5]  eon BottouAT Stochastic Gradient Learning in Neural Networks , 2022 .

[6]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[7]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[8]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Steve Renals,et al.  Convolutional Neural Networks for Distant Speech Recognition , 2014, IEEE Signal Processing Letters.

[10]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11]  Yoshua Bengio A Connectionist Approach to Speech Recognition , 1993, Int. J. Pattern Recognit. Artif. Intell..

[12]  Wu Chou,et al.  Robust decision tree state tying for continuous speech recognition , 2000, IEEE Trans. Speech Audio Process..

[13]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[14]  Dimitri Palaz,et al.  End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks , 2013, ArXiv.

[15]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[16]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[18]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[19]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[20]  Dimitrios Dimitriadis,et al.  Investigating deep neural network based transforms of robust audio features for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[22]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[23]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Steve Young,et al.  The HTK book , 1995 .

[25]  Geoffrey E. Hinton,et al.  Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[27]  Françoise Fogelman-Soulié,et al.  Experiments with time delay networks and dynamic time warping for speaker independent isolated digits recognition , 1989, EUROSPEECH.

[28]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[29]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.