On the Design of an Automatic Speech Recognition System for Romanian Language

For decades, engineers and scientists have studied the phenomenon and production of speech, with an eye on creating more effective and efficient systems for human-computer interaction. This paper presents a large number of experiments made to create an automatic speech recognition system (ASR) for spoken Romanian connected digits. State-of-the-art hidden Markov acoustic models (HMMs) and a finite state grammar language model are used, in order to build and optimize a fully-functional digit recognizer system in Romanian language. The applications of speech recognition in daily life are multiple, and truly there are no limits to the use cases of this technology: from niche applications like medical interfaces and industrial command and control systems to consumer applications, where modern phone operating systems offer speech interfaces to interact with the system. The rich mathematical framework of HMMs makes statistical approaches very feasible for this task, and one of the goals of this paper is to confirm the validity and reproducibility of this method. Another objective is the integration of the components and toolkits necessary to build a recognition system, briefly describe the processes involved in speech representation, the mathematics behind it and the analysis for improving and optimizing the primary evaluation metrics. The results show the advantage of training with a larger speaker database, in order to obtain an independent speech recognizer, with more than 60% WER improvements compared to a dependent model, for a 90 speaker database used for evaluation. The implementation of the system and the experiments, along with the evaluation results for decoding and optimization are provided.

[1]  CH' , 2018, Dictionary of Upriver Halkomelem.

[2]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Horia Cucu,et al.  A robust diacritics restoration system using unreliable raw text data , 2014, SLTU.

[4]  Dragos Burileanu,et al.  An optimized TTS system implementation using a Motorola StarCore SC140-based processor , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Manfred K. Warmuth,et al.  THE CMU SPHINX-4 SPEECH RECOGNITION SYSTEM , 2001 .

[7]  Andrey Ronzhin,et al.  Assistive Multimodal Interface for Medical Applications , 2006 .

[8]  Ali Eydgahi,et al.  Design of Matlab ®-Based Automatic Speaker Recognition Systems , 2006 .

[9]  Sharon L. Oviatt,et al.  The efficiency of multimodal interaction: a case study , 1998, ICSLP.

[10]  Sadaoki Furui,et al.  40 Years of Progress in Automatic Speaker Recognition , 2009, ICB.

[11]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[12]  Stephen A. Zahorian,et al.  Vowel classification for computer-based visual feedback for speech training for the hearing impaired , 2002, INTERSPEECH.

[13]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[14]  Robert Oshana,et al.  1 – Introduction to Digital Signal Processing , 2006 .

[15]  Lawrence R. Rabiner,et al.  Automatic Speech Recognition - A Brief History of the Technology Development , 2004 .

[16]  I. Gavat,et al.  A HISTORICALLY PERSPECTIVE OF SPEAKER-INDEPENDENT SPEECH RECOGNITION IN ROMANIAN LANGUAGE , 2014 .

[17]  Mei-Yuh Hwang,et al.  An Overview of the SPHINX-II Speech Recognition System , 1993, HLT.

[18]  Chorkin Chan,et al.  Isolated Word Recognition by Neural Network Models with Cross-Correlation Coefficients for Speech Dynamics , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Wenli Zhou,et al.  A Comparison between HTK and SPHINX on Chinese Mandarin , 2009, 2009 International Joint Conference on Artificial Intelligence.