论文信息 - Optimal Selection of Limited Vocabulary Speech Corpora

Optimal Selection of Limited Vocabulary Speech Corpora

We address the problem of finding a subset of a large speech data corpus that is useful for accurately and rapidly prototyping novel and computationally expensive speech recognition architectures. To solve this problem, we express it as an optimization problem over submodular functions. Quantities such as vocabulary size (or quality) of a set of utterances, or quality of a bundle of word types are submodular functions which make finding the optimal solutions possible. We, moreover, are able to express our approach using graph cuts leading to a very fast implementation even on large initial corpora. We show results on the Switchboard-I corpus, demonstrating improved results over previous techniques for this purpose. We also demonstrate the variety of the resulting corpora that may be produced using our method.

Hui Lin | Jeff A. Bilmes

[1] Satoru Iwata,et al. A push-relabel framework for submodular function minimization and applications to parametric optimization , 2003, Discret. Appl. Math..

[2] M. L. Fisher,et al. An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[3] Hui Lin,et al. How to select a good training-data subset for transcription: submodular active selection for sequences , 2009, INTERSPEECH.

[4] Simon King,et al. SVitchboard 1: Small Vocabulary Tasks from Switchboard 1 , 2005 .

[5] Simon King,et al. SVitchboard 1: small vocabulary tasks from Switchboard , 2005, INTERSPEECH.

[6] Robert E. Tarjan,et al. A Fast Parametric Maximum Flow Algorithm and Applications , 1989, SIAM J. Comput..

[7] Thorsten Brants,et al. Study on interaction between entropy pruning and kneser-ney smoothing , 2010, INTERSPEECH.

[8] Hui Lin,et al. A Class of Submodular Functions for Document Summarization , 2011, ACL.

[9] Satoru Fujishige,et al. Submodular functions and optimization , 1991 .

[10] H. Narayanan. Submodular functions and electrical networks , 1997 .

[11] Joseph Picone,et al. Resegmentation of SWITCHBOARD , 1998, ICSLP.

[12] Hui Lin. An Application of the Submodular Principal Partition to Training Data Subset Selection , 2010 .

[13] Hui Lin,et al. Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.