Utilizing data-driven and knowledge-based techniques to enhance arabic speech recognition

Pronunciation variation is a well-known phenomenon which leads to performance reduction in speech recognition systems. This performance reduction factor occurs mainly in two forms: within-word pronunciation variation, and cross-word pronunciation variation. The within-word variation occurs inside the word, while the cross-word variation occurs when two successive words interact leading to a different pronunciation in one or two letters. Furthermore, the two words could merge together creating one continuous utterance with no clear boundary between them. In speech recognition, within-word and cross-word pronunciation variations alter the phonetic spelling of words beyond their listed forms in the pronunciation dictionary, leading to a number of out-of-vocabulary word forms, and consequently reducing the speech recognition performance. Pronunciation variation problems could also arise in the form of an incorrectly recognized word sequence with out-of-language syntax. In this thesis we propose knowledge-based and data-driven techniques to solve these three problems (i.e. within-word, cross-word, and out of correct order syntactical structures). The proposed methods were investigated on a modern standard Arabic speech recognition system using Carnegie Mellon University Sphinx speech recognition engine. The first problem (within-word variations) was modeled using the data-driven approach which utilizes a dynamic programming method (sequence alignment for phonemes) to distill variants from the pronunciation corpus. The results showed that this technique achieved significant improvements of 1.82%. The second problem (cross-word variations) was modeled using three different tracks: a knowledge-based approach (using Arabic phonological rules), a knowledge-based approach (using part of speech tagging), and a data-driven approach (by merging small words). The results showed that the three above mentioned tracks achieved significant improvements. The part of speech tagging approach achieved the highest improvement of 2.39%, followed by the phonological rules approach, achieving 2.30% and finally the merging small words approach achieving 2.16%, over the baseline system. The third problem was modeled using a data mining algorithm to extract the best language syntax rules, that can be later used for rescoring the N-best hypotheses. A Stanford Arabic tagger was used for the tagging process. This method, nevertheless, did not lead to a significant improvement