论文信息 - Wav2Letter++: A Fast Open-source Speech Recognition System - 字舞流文

Wav2Letter++: A Fast Open-source Speech Recognition System

This paper introduces wav2letter++, a fast open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. We explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2× faster than other optimized frameworks for training end-to-end neural networks for speech recognition. We also show that wav2letter++ training times scale linearly to 64 GPUs, the most we tested, for models with 100 million parameters. High-performance frameworks enable fast iteration, which is often a crucial factor in successful research and model tuning on new datasets and tasks.

Gabriel Synnaeve | Ronan Collobert | Vineel Pratap | Qiantong Xu | Awni Hannun | Jacob Kahn | Vitaliy Liptchinsky | Jeff Cai | Ronan Collobert | Awni Y. Hannun | Gabriel Synnaeve | Qiantong Xu | Vineel Pratap | Jeff Cai | Jacob Kahn | Vitaliy Liptchinsky

[1] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[2] Jack Dongarra,et al. Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[3] Gabriel Synnaeve,et al. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[4] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[5] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[6] Janet M. Baker,et al. The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[7] Kenta Oono,et al. Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[8] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[9] Krunal Patel,et al. ArrayFire: a GPU acceleration platform , 2012, Defense, Security, and Sensing.

[10] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[11] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[12] Gabriel Synnaeve,et al. Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[13] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[14] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[15] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[17] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[18] Boris Ginsburg,et al. Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq , 2018, 1805.10387.

[19] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[20] Boris Ginsburg,et al. OpenSeq2Seq: Extensible Toolkit for Distributed and Mixed Precision Training of Sequence-to-Sequence Models , 2018, ArXiv.

[21] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).