Finite-state transducer based hungarian LVCSR with explicit modeling of phonological changes

ABSTRACTThis article describes the design and the experimental evalua-tion of the rst Hungarian large vocabulary continuous speechrecognition (LVCSR) system. The architecture of the recogni-tion system is based on the recently proposed weighted nitestate transducer (WFST) paradigm. The task domain is therecognition of uently read sentences selected from a majordaily newspaper. Recognition performance is evaluated usingboth monophone and triphone gender independent acousticmodels. The vocabulary units used in the system are mor-pheme based in order to provide sucient coverage of thelarge number of word-forms resulting from axation and com-pounding in Hungarian. The language model is a statisticalmorpheme bigram model. Besides the basic list style pronun-ciation dictionary model we evaluate a novel phonology model-ing component that describes the phonological changes preva-lent in uent Hungarian. Thanks to the exible transducer-based architecture of the system the phonological componentis integrated seamlessly with the basic modules with no needto modify the decoder itself. The proposed phonological mo-del decreases the error rate by 8.32% relatively compared tothe baseline triphone system. The morpheme error rate of thebest con guration is 17.74% in a 1200 morpheme task withtest set perplexity 70.1. INTRODUCTIONHungarian is a Finno-Ugric language spoken by about 15 mil-lion people mainly in Hungary and in the neighbouring coun-tries. There are 64 phonemes (14 vowels and 50 consonants) inHungarian that can be divided into two groups of short/longpairs (length is a phonemically distinguishing feature both inthe case of vowels and consonants). Similarly to the othermembers of the Finno-Ugric language family Hungarian is anagglutinating language, that is, it relies heavily on suxes.Hungarian is using the Latin alphabet and the written andspoken forms of words have a relatively close correspondance.In most cases, the words are spoken as written but the conso-nant combinations that would be dicult to pronounce con-stitute an exception to this rule.Speech research has a long tradition in Hungary [3] andthere exist several research and commercial systems both forspeech synthesis and automatic speech recognition (ASR). Pre-vious ASR research e orts have been limited, however, tocommand and control tasks that have a limited vocabulary.Besides the shortage of resources the main obstacle that de-layed the beginning of Hungarian LVCSR research is the sizeof the vocabulary and the complexity of the morphology. Thenumber of di erent word forms is in the range of hundreds ofmillions according to an estimate by the authors of the bestHungarian spell-checking software [7] and the accurate mod-eling of this vocabulary is not easy even with morphologicaldecomposition because the number of inection classes is verylarge due to historic reasons. The other diculty from anASR point of view is the accurate computational representa-tion of pronunciation. It is true that the spelling system isphonemic and the written and spoken forms are strongly re-lated but the pronunciation of most words starting or endingwith a consonant depends on the adjacent words because dif- cult consonant combinations are replaced by simpler ones bya hierarchy of phonological rules.In our previous work [8, 9] we proposed methods for treat-ing both of these obstacles but these methods could not beevaluated that time due to the lack of suitable databases andthe lack of an implementation. In Section 2 of this articlewe describe the architecture of our new weighted nite statetransducer based recognition system that was designed to fa-cilitate an ecient implementation of both our phonology andmorphology modeling methods. Then we give an overview ofthe acoustic and language modeling components including adescription of our recently collected speech database and thelanguage model database. The details of our pronunciationand phonology modeling method are explained in Section 4while the results of the experimental evaluation of the systemare described in Section 5. Finally, we conclude our work inSection 6 with a summary and plans for future work.2. SYSTEM OVERVIEWThe standard knowledge components in a state-of-the-art ASRsystem are the acoustic model, the pronunciation model andthe language model. The usual practice is to represent eachof these di erent types of knowledge in their specialized datastructure and to use dedicated code in the decoder for combin-ing and searching them. This practice has been motivated bythe need for very ecient implementations and possibly alsoby the incremental development of the recognition systems.The price of this highly optimized implementation is, how-ever, the loss of exibility for adding new knowledge sourcesto the system. The reason is that the specialized code gets in-creasingly complex and usually only the original developer ofthe decoder module would be able to add the new components.Even though it has been widely understood for a long whilethat all the usual knowledge sources (KS) are just di erentinstantiations of the same basic mathematical data structureit has only recently been demonstrated [6] that a recognitionsystem using a at data representation and generic algorithmsfor all KSs can achieve, with a ordable system resources, aperformance similar to specialized systems.This weighted nite-state transducer (WFST) based ar-chitecture [5, 6] is especially attractive for us because all thephonological and morphological dependencies described in ourprevious work [8, 9] can be easily converted into a WFST rep-resentation. Moreover, we believe that higher level linguisticdependencies, such as the agreement of the number and per-son of the subject and the predicate, can also be straightfor-wardly represented in this framework. Therefore we designedour recognition system from the beginning according to theat-data WFST paradigm.