论文信息 - CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition

CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition

In this paper, we propose a novel soft and monotonic alignment mechanism used for sequence transduction. It is inspired by the integrate-and-fire model in spiking neural networks and employed in the encoder-decoder framework consists of continuous functions, thus being named as: Continuous Integrate-and-Fire (CIF). Applied to the ASR task, CIF not only shows a concise calculation, but also supports online recognition and acoustic boundary positioning, thus suitable for various ASR scenarios. Several support strategies are also proposed to alleviate the unique problems of CIF-based model. With the joint action of these methods, the CIF-based model shows competitive performance. Notably, it achieves a word error rate (WER) of 2.86% on the test-clean of Librispeech and creates new state-of-the-art result on Mandarin telephone ASR benchmark.

Linhao Dong | Bo Xu | Bo Xu | Linhao Dong

[1] Wonyong Sung,et al. Fully Neural Network Based Speech Recognition on Mobile and Embedded Devices , 2018, NeurIPS.

[2] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[3] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4] Wolfgang Maass,et al. Networks of Spiking Neurons: The Third Generation of Neural Network Models , 1996, Electron. Colloquium Comput. Complex..

[5] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[6] Navdeep Jaitly,et al. Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[7] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[8] Matt Shannon,et al. Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping , 2017, INTERSPEECH.

[9] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.

[10] Alex Graves,et al. Video Pixel Networks , 2016, ICML.

[11] Samy Bengio,et al. An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.

[12] Wofgang Maas,et al. Networks of spiking neurons: the third generation of neural network models , 1997 .

[13] Colin Raffel,et al. Monotonic Chunkwise Attention , 2017, ICLR.

[14] Gabriel Synnaeve,et al. Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[15] Boris Ginsburg,et al. Jasper: An End-to-End Convolutional Neural Acoustic Model , 2019, INTERSPEECH.

[16] Shiliang Zhang,et al. Gaussian Prediction Based Attention for Online End-to-End Speech Recognition , 2017, INTERSPEECH.

[17] Ronan Collobert,et al. Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Peter W. Jusczyk,et al. How infants begin to extract words from speech , 1999, Trends in Cognitive Sciences.

[19] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[20] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[21] Luke S. Zettlemoyer,et al. Transformers with convolutional context for ASR , 2019, ArXiv.

[22] Pascale Fung,et al. HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus , 2006, ISCSLP.

[23] Satoshi Nakamura,et al. Local Monotonic Attention Mechanism for End-to-End Speech And Language Processing , 2017, IJCNLP.

[24] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[25] Jonathan Le Roux,et al. Triggered Attention for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Mohan Li,et al. End-to-end Speech Recognition with Adaptive Computation Steps , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Anthony N. Burkitt,et al. A Review of the Integrate-and-fire Neuron Model: I. Homogeneous Synaptic Input , 2006, Biological Cybernetics.

[28] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[30] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Hui Bu,et al. AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[32] Gang Liu,et al. An Online Attention-based Model for Speech Recognition , 2018, INTERSPEECH.

[33] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[35] Wei Chen,et al. Extending Recurrent Neural Aligner for Streaming End-to-End Speech Recognition in Mandarin , 2018, INTERSPEECH.

[36] Richard Socher,et al. Improving End-to-End Speech Recognition with Policy Learning , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] L. F Abbott,et al. Lapicque’s introduction of the integrate-and-fire model neuron (1907) , 1999, Brain Research Bulletin.

[38] Bo Xu,et al. Self-attention Aligner: A Latency-control End-to-end Model for ASR Using Self-attention Network and Chunk-hopping , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Hermann Ney,et al. An Analysis of Local Monotonic Attention Variants , 2019, INTERSPEECH.

[40] Gabriel Synnaeve,et al. Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Shuang Xu,et al. A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese , 2018, ICONIP.

[42] Zhiheng Huang,et al. Self-attention Networks for Connectionist Temporal Classification in Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43] Hermann Ney,et al. RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[44] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45] Hermann Ney,et al. Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[46] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[47] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[48] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.