On the evaluation of speech recognizers and data bases using a reference system

The most straightforward way to compare the performance of speech recognizers is to test them with an identical data base. An agreement on such data base to be used as a standard could be difficult to reach. An alternative is to agree on a set of algorithms (a reference system) and compare each system to this reference. Furthermore, the reference is used to quantify the difficulty of the test sets. Differences in performance between the system under test and the reference will be meaningful to the speech community if the reference system is made widely available. The complete specifications (in FORTRAN) of a set of speech analysis and pattern discrimination algorithms are proposed here for this purpose. This recognizer uses dynamic time warping to optimize the match between unknown and reference utterances. Every utterance is coded as a sequence of vectors of cepstral coefficients. These coefficients are obtained from a short time power spectrum expressed on a mel frequency scale.