A new metric for stochastic language model evaluation

Though Perplexity shows good correlation with word error rate within simple n-gram framework like Wall Street Journal task, it has been reported that perplexity have poor correlation with WER when more complicated LM is used. In this paper, a global measure for language model evaluation is proposed which achieveshigher correlation between word accuracy. The metric is based on difference of LM score between a word in the evaluation text and the word that gives the maximum score at that context. Two experiments were carried out to investigate the correlation between word accuracy and the proposed measure. In the first experiment, LMs in this paper were created using n-gram adaptation by n-gram count mixture. 47 LMs were created for the experiments by changing mixture weight and vocabulary cut-off threshold. Correlation betwen perplexity and word accuracy was very poor (correlation coefficient -0.36). On the other hand, the proposed metric gave much higher correlation (correlation coefficient 0.82). In the second experiment, a simple mixture trigram model was employed to recognize Switchboard task data. The highest correlation between word accuracy and the proposed method was 0.81, which was much higher than the correlation between PP and accucary 0.59.