A study of unsupervised clustering techniques for language modeling

There has been recent interest in clustering text data to build topic-specific language models for large vocabulary speech recognition. In this paper, we studied various unsupervised clustering algorithms on several corpora. First we compared the clustering methods with quality metrics such as entropy and purity. Of the techniques studied, two-phase bisecting K-means achieved good performance with relatively fast speed. Then we performed speech recognition experiments on English and Arabic systems using the automatically derived topic-based language models. We obtained modest word error rate improvements, comparable to previously published studies. A careful analysis of the correlation between word error rate and the distribution of misrecognized words, including an informationgain metric, is presented.

[1]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures vs. dynamic cache models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..

[4]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[5]  Bhuvana Ramabhadran,et al.  The IBM 2007 speech transcription system for European parliamentary speeches , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[6]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[7]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[8]  Feifan Liu,et al.  Unsupervised language model adaptation via topic modeling based on named entity hypotheses , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Lin-Shan Lee,et al.  Robust topic inference for latent semantic language model adaptation , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[12]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[13]  Ronald Rosenfeld,et al.  Large-Scale Topic Detection and Language Model Adaptation. , 1997 .