Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.

Tara N. Sainath | Ankur Bapna | Navdeep Jaitly | Yuan Cao | Patrick Nguyen | Yonghui Wu | Orhan Firat | Melvin Johnson | Wolfgang Macherey | Zhifeng Chen | Ye Jia | George F. Foster | Ron J. Weiss | Akiko Eriguchi | Rohit Prabhavalkar | Qiao Liang | Colin Cherry | Kuan-Chieh Wang | Shuyuan Zhang | Jan Chorowski | Sébastien Jean | Parisa Haghani | Bo Li | Ciprian Chelba | Suyog Gupta | Dehao Chen | Chung-Cheng Chiu | Anjuli Kannan | Ekaterina Gonina | Mike Schuster | Kazuki Irie | George Foster | Yanping Huang | HyoukJoong Lee | Ruoming Pang | Isaac Caswell | John Richardson | Xiaobing Liu | Wei-Ning Hsu | Bowen Liang | Yanzhang He | Rohan Anil | Katrin Tomanek | Zongheng Yang | Zelin Wu | Llion Jones | Raziel Alvarez | Naveen Ari | Stella Laurenzo | Youlong Cheng | Jonathan Shen | James Qin | Otavio Good | Mia X. Chen | Smit Hinsu | Benoit Jacob | Rajat Tibrewal | Ben Vanik | et al. | Z. Chen | J. Chorowski | Orhan Firat | Sébastien Jean | Colin Raffel | Kanishka Rao | M. Schuster | Yonghui Wu | Llion Jones | H. Zen | Suyog Gupta | Rohan Anil | C. Chiu | Rohit Prabhavalkar | Anjuli Kannan | Bo Li | M. Bacchiani | Zongheng Yang | Ankur Bapna | Dmitry Lepikhin | Melvin Johnson | M. Krikun | M. Chen | Yuan Cao | Colin Cherry | Wolfgang Macherey | Xiaobing Liu | Youlong Cheng | HyoukJoong Lee | Dehao Chen | Yanping Huang | Ekaterina Gonina | S. Sabour | Ciprian Chelba | Qi Ge | William Chan | David Rybach | Benoit Jacob | Ian McGraw | Katrin Tomanek | Akiko Eriguchi | Vijayaditya Peddinti | Yanzhang He | James Qin | Ruoming Pang | Jonathan Shen | Klaus Macherey | Kazuki Irie | G. Pundak | Semih Yavuz | Wei-Ning Hsu | Ye Jia | Chad Whipkey | Isaac Caswell | Qiao Liang | Pat Rondon | M. Nirschl | Shankar Kumar | Uri Alon | A. Bruguier | R. Álvarez | Bowen Liang | Shubham Toshniwal | Yu Zhang | J. Richardson | Benjamin Lee | Shuyuan Zhang | Parisa Haghani | Zelin Wu | Ye Tian | N. Jaitly | Deepti Bhatia | Justin Carlson | Patrick Nguyen | Kuan Wang | Robert Suderman | Stella Laurenzo | Naveen Ari | Ian Williams | Smit Hinsu | Rajat Tibrewal | Otavio Good | Ben Vanik | M. Murray | T. Jablin | M. Galkin | Todd Wang | Baohua Liao | R. Suderman | K. Tomanek | Chung-Cheng Chiu

[1]  Tara N. Sainath,et al.  No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yuan Cao,et al.  Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[5]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[7]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[8]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[9]  Tara N. Sainath,et al.  Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[11]  Colin Raffel,et al.  Learning Hard Alignments with Variational Inference , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Navdeep Jaitly,et al.  Speech recognition for medical conversations , 2017, INTERSPEECH.

[13]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[14]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[15]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Tara N. Sainath,et al.  Improving the Performance of Online Neural Transducer Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Tara N. Sainath,et al.  Contextual Speech Recognition in End-to-end Neural Network Systems Using Beam Search , 2018, INTERSPEECH.

[18]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[19]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tara N. Sainath,et al.  Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tara N. Sainath,et al.  Compression of End-to-End Models , 2018, INTERSPEECH.