SVitchboard-II and FiSVer-I: Crafting high quality and low complexity conversational english speech corpora using submodular function optimization

Abstract We introduce a set of benchmark corpora of conversational English speech derived from the Switchboard-I and Fisher datasets. Traditional automatic speech recognition (ASR) research requires considerable computational resources and has slow experimental turnaround times. Our goal is to introduce these new datasets to researchers in the ASR and machine learning communities in order to facilitate the development of novel speech recognition techniques on smaller but still acoustically rich, diverse, and hence interesting corpora. We select these corpora to maximize an acoustic quality criterion while limiting the vocabulary size (from 10 words up to 10,000 words), where both “acoustic quality” and vocabulary size are adeptly measured via various submodular functions. We also survey numerous submodular functions that could be useful to measure both “acoustic quality” and “corpus complexity” and offer guidelines on when and why a scientist may wish use to one vs. another. The corpora selection process itself is naturally performed using various state-of-the-art submodular function optimization procedures, including submodular level-set constrained submodular optimization (SCSC/SCSK), difference-of-submodular (DS) optimization, and unconstrained submodular minimization (SFM), all of which are fully defined herein. While the focus of this paper is on the resultant speech corpora, and the survey of possible objectives, a consequence of the paper is a thorough empirical comparison of the relative merits of these modern submodular optimization procedures. We provide baseline word recognition results on all of the resultant speech corpora for both Gaussian mixture model (GMM) and deep neural network (DNN)-based systems, and we have released all of the corpora definitions and Kaldi training recipes for free in the public domain.

[1]  Jeff A. Bilmes,et al.  PAC-learning Bounded Tree-width Graphical Models , 2004, UAI.

[2]  Hui Lin,et al.  Optimal Selection of Limited Vocabulary Speech Corpora , 2011, INTERSPEECH.

[3]  Abhimanyu Das,et al.  Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection , 2011, ICML.

[4]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[5]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Jeff A. Bilmes,et al.  A Submodular-supermodular Procedure with Applications to Discriminative Structure Learning , 2005, UAI.

[7]  Satoru Fujishige,et al.  Submodular functions and optimization , 1991 .

[8]  Rishabh K. Iyer,et al.  Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints , 2013, NIPS.

[9]  Rishabh K. Iyer,et al.  SVitchboard II and fiSVer i: high-quality limited-complexity corpora of conversational English speech , 2015, INTERSPEECH.

[10]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Hui Lin,et al.  How to select a good training-data subset for transcription: submodular active selection for sequences , 2009, INTERSPEECH.

[12]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[13]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[14]  H. B. McMahan,et al.  Robust Submodular Observation Selection , 2008 .

[15]  Simon King,et al.  SVitchboard 1: Small Vocabulary Tasks from Switchboard 1 , 2005 .

[16]  Rishabh K. Iyer,et al.  Learning Mixtures of Submodular Functions for Image Collection Summarization , 2014, NIPS.

[17]  Jeff A. Bilmes,et al.  Submodular feature selection for high-dimensional acoustic score spaces , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Rishabh K. Iyer,et al.  Algorithms for Approximate Minimization of the Difference Between Submodular Functions, with Applications , 2012, UAI.

[19]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[20]  Andreas Krause,et al.  Efficient Sensor Placement Optimization for Securing Large Water Distribution Networks , 2008 .

[21]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Jeff A. Bilmes,et al.  Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[23]  Jeff A. Bilmes,et al.  Classification of developmental disorders from speech signals using submodular feature selection , 2013, INTERSPEECH.

[24]  Jeff A. Bilmes,et al.  Q-Clustering , 2005, NIPS.

[25]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[26]  Xiaohui Zhang,et al.  Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[27]  Georg Heigold,et al.  Sequence discriminative distributed training of long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[28]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[29]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[30]  George Saon,et al.  The IBM 2015 English conversational telephone speech recognition system , 2015, INTERSPEECH.

[31]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[32]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Jeff A. Bilmes,et al.  Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Liang Lu,et al.  Probabilistic linear discriminant analysis with bottleneck features for speech recognition , 2014, INTERSPEECH.

[35]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..