A comparison of the data requirements of automatic speech recognition systems and human listeners

Since the introduction of hidden Markov modelling there has been an increasing emphasis on data-driven approaches to automatic speech recognition. This derives from the fact that systems trained on substantial corpora readily outperform those that rely on more phonetic or linguistic priors. Similarly, extra training data almost always results in a reduction in word error rate - “there's no data like more data”. However, despite this progress, contemporary systems are not able to fulfill the requirements demanded by many potential applications, and performance is still significantly short of the capabilities exhibited by human listeners. For these reasons, the R&D community continues to call for even greater quantities of data in order to train their systems. This paper addresses the issue of just how much data might be required in order to bring the performance of an automatic speech recognition system up to that of a human listener.

[1]  Jean-Luc Gauvain,et al.  Unsupervised acoustic model training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[3]  Joost Van de Weijer,et al.  Language input for word discovery , 1999 .

[4]  Jean-Luc Gauvain,et al.  Lightly Supervised Acoustic Model Training , 2000 .

[5]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[6]  S. Suter Meaningful differences in the everyday experience of young American children , 2005, European Journal of Pediatrics.

[7]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[8]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .