Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with less learning parameters than real-valued models. This paper proposes to integrate multiple feature views in quaternion-valued convolutional neural network (QCNN), to be used for sequence-to-sequence mapping with the CTC model. Promising results are reported using simple QCNNs in phoneme recognition experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme error rate (PER) with less learning parameters than a competing model based on real-valued CNNs.

[1]  Soo-Chang Pei,et al.  Edge detection, color quantization, segmentation, texture removal, and noise reduction of color image using quaternion iterative filtering , 2014, J. Electronic Imaging.

[2]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[3]  Yoshua Bengio,et al.  Global optimization of a neural network-hidden Markov model hybrid , 1992, IEEE Trans. Neural Networks.

[4]  Luigi Fortuna,et al.  Neural networks for quaternion-valued function approximation , 1994, Proceedings of IEEE International Symposium on Circuits and Systems - ISCAS '94.

[5]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[9]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[10]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[12]  Nobuyuki Matsui,et al.  Feed forward neural network with random quaternionic neurons , 2017, Signal Process..

[13]  Gang Chen,et al.  Color Image Analysis by Quaternion-Type Moments , 2014, Journal of Mathematical Imaging and Vision.

[14]  Yoshua Bengio,et al.  Improving Speech Recognition by Revising Gated Recurrent Units , 2017, INTERSPEECH.

[15]  Hiromi Kusamichi,et al.  A New Scheme for Color Night Vision by Quaternion Neural Network , 2004 .

[16]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[17]  Titouan Parcollet,et al.  Deep quaternion neural networks for spoken language understanding , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Liang Lu,et al.  Multitask Learning with CTC and Segmental CRF for Speech Recognition , 2017, INTERSPEECH.

[19]  Liang Lu,et al.  Segmental Recurrent Neural Networks for End-to-End Speech Recognition , 2016, INTERSPEECH.

[20]  T. Nitta,et al.  A quaternary version of the back-propagation algorithm , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[21]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[24]  S. Furui,et al.  Speaker-independent isolated word recognition based on emphasized spectral dynamics , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Anthony S. Maida,et al.  Deep Quaternion Networks , 2017, 2018 International Joint Conference on Neural Networks (IJCNN).

[26]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Titouan Parcollet,et al.  Quaternion Denoising Encoder-Decoder for Theme Identification of Telephone Conversations , 2017, INTERSPEECH.

[28]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[30]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[31]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[32]  Anubhav Garg,et al.  Vector Sparse Representation of Color Image Using Quaternion Matrix Analysis based on Genetic Algorithm , 2017 .

[33]  Titouan Parcollet,et al.  Quaternion Neural Networks for Spoken Language Understanding , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[34]  Dongpo Xu,et al.  Learning Algorithms in Quaternion Neural Networks Using GHR Calculus , 2017 .

[35]  Sandeep Subramanian,et al.  Deep Complex Networks , 2017, ICLR.

[36]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.