Within-speaker variability of the word error rate for a continuous speech recognition system

WITHIN-SPEAKERVARIABILITYOFTHEWORDERRORRATEFORACONTINUOUSSPEECHRECOGNITIONSYSTEMDavid A. van Leeuwen and Herman J. M. SteenekenElectronic mail:fvanLeeuwen;Steenekg@tm.tno.nlTNO Human Factors Research Institute.Postbus 23,3769 ZG So esterb erg,The Netherlands.ABSTRACTThevarianceofthep erformanceacontinuoussp eechrecognitionsystemsub jectedtoreplicaut-terancesofthesamesentencesp okenbysp eaker has b een investigated.In an exp eriment withthree di erent sp eech recognition systems in three dif-ferentlanguageswithwodi erengrammarcondi-tions it is shown that the sentence word error rate hasavariance that can b e describ ed in terms of binomialstatistics.The distribution of the measured varianceshows a remarkable corresp ondence to the parameter-free theoretical distribution.It is therefore concludedthatfortheworderrorrateofacontinuoussp eechrecognition system binomial statistics apply.INTRODUCTIONThe word error rate (sometimes expressed in its com-plement, the accuracy) is the most widely used mea-sureofthep erformancesp eechrecognitionsys-tems.Traditionally, for isolated word recognizers thismeasure has b een one which leaves little argument forinterpretation, but for continuous sp eech recognitionsystemsthesituationismorecomplex.Becauseofthe nature of natural sp eech the words are connectedto a long string.This makes it somewhat dicult topinp oint the exact lo cation of an error in case of mis-recognition and consequently makes it hard to countthe numb er of erroneous words.Evaluating the cor-rectness of utterance as a whole, measured in the ut-terance (or sentence) error rate resolves this problem.However, this measure needs much more sp eech mate-rial b efore an accurate gure is found, and researchersoften use the word error rate b ecause it is more sensi-tive to small changes in the p erformance of the sp eechrecognition system.One of the questions wewant to address in thispap er,ishowaccurateameasurementoftheorderror rate is for a continuous sp eech recognition sys-tem.Forarepresentativeevaluation,onegenerallywantstohaveawidecoerageoflanguage,andincaseofasp eakerindep endentsystem,widecov-erageofsp eakers.Becauseb othsetsarevirtuallyin niteinsize,foreachevaluationnewsamplesaredrawn from the sets of language material (sentences)and sp eakers.If there are ways to quantify the accu-racy of a word error rate measure, and ob jectivewaysto calibrate the `diculty' of the test material [1], anewevaluationcansuccessfullybecomparedtoanearlier one.EXPERIMENTAL SETUPIn order to study the inherentvariability of the p er-formance of a continuous sp eech recognizer, we p er-formed a test with no variability in sp eaker and sp o-ken text.This exp erimentwas carried out as an addi-tional test in the pro jectSqale, whichwas a pro jectcompared sp eecrecognition indi erentEuro-p ean languages and for di erent systems [1, 2].Thevariabilityinsp eakerandsp eechcontentwasmadezero byhaving a sp eaker read out the same sentenceseveral times,ofwhichwecalltheindividualutter-ancesreplicasofthesamesentence.(Theserepli-cascaninprinciplebeusedtomeasure thewithin-sp eaker variability.)The replicas were recorded dur-ingarecordingsessionoftheevaluationtestSqale,andwerespreadamongthenormalevalu-ation sentences.The sp eakers were prepared for theo ccurrence of replicas, and were requested to read outa replica as if it was the rst o ccurrence in order tomaketheutterancesasmuchalikp ossible.Wchose for 5 replicas of one sentence for each recordedsp eaker; more replicas mighthave stretched the sub-ject'sacceptancelimitsto ofar,andwedidnotwant thatthereading style of theother (evaluationtest) utterances was inuenced by this test.Table.The numb er of sentences available,for each lan-guage.Each sp eaker, having its own sentence, uttered 5replicas.The number of speech recognition systems avail-able p er language is also indicated, as well as the amountof measurement p oints resulting.LanguageAmericanBritishGermanEnglishsentences3710systems32grammars2data p oints184240Thereplicautteranceswererecordedthreedi erentlanguages, in amounts according to the ta-