The Influence of Errors in Phonetic Annotations on Performance of Speech Recognition System

This paper deals with errors in acoustic training data and the influence on speech recognition performance. The training data can be prepared manually, automatically or by combination of these two. In all cases, some mislabeled phonemes can appear in phonetic annotations. We conducted series of experiments which simulate some common errors. The experiments deal with various amount of changes in phonetic annotations such as different types of changes in voicing of obstruents, random substitution of consonants or vowels and random deleting of phonemes. All experiments were done for Czech language using GlobalPhone speech data set and both Gaussian mixture models and deep neural networks were used for acoustic modeling. The results show that some amount of such errors in training data does not influence speech recognition accuracy. The accuracy is significantly influenced only by large amount of errors (more than 50%).

[1]  Jan Nouza,et al.  System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search , 2010, Melecon 2010 - 2010 15th IEEE Mediterranean Electrotechnical Conference.

[2]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[3]  Jan Nouza,et al.  Cross-Lingual Adaptation of Broadcast Transcription System to Polish Language Using Public Data Sources , 2015, LTC.

[4]  Jindrich Zdansky,et al.  Investigation into the use of deep neural networks for LVCSR of Czech , 2015, 2015 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM).

[5]  Joseph Picone,et al.  Effects on transcription errors on supervised learning in speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Peter Boesiger,et al.  On the influence of training data quality in k‐t BLAST reconstruction , 2004, Magnetic resonance in medicine.

[7]  Jan Silovský,et al.  Speech-to-text technology to transcribe and disclose 100, 000+ hours of bilingual documents from historical Czech and Czechoslovak radio archive , 2014, INTERSPEECH.

[8]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Jan Nouza,et al.  Unified Approach to Development of ASR Systems for East Slavic Languages , 2017, SLSP.

[10]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[11]  Jan Nouza,et al.  ASR for South Slavic Languages Developed in Almost Automated Way , 2016, INTERSPEECH.

[12]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.