论文信息 - Unsupervised clustering of spontaneous speech documents

Unsupervised clustering of spontaneous speech documents

This paper presents an unsupervised method for clustering spontaneous speech documents. The approach uses a hierarchical algorithm to automatically determine the number of clusters and a starting model for a subsequent iterative algorithm. We have evaluated this method on the Switchboard corpus and compared it to a set of supervised and other unsupervised methods. The results show that our method significantly outperforms the rest of the approaches.

Jordi Turmo | Edgar González

[1] Sebastian Thrun,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[2] G. W. Milligan,et al. An examination of procedures for determining the number of clusters in a data set , 1985 .

[3] John J. Godfrey,et al. SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4] André Hardy,et al. An examination of procedures for determining the number of clusters in a data set , 1994 .

[5] George Karypis,et al. Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[6] Mihai Surdeanu,et al. A hybrid unsupervised approach for document clustering , 2005, KDD '05.

[7] Alessandro Vinciarelli,et al. Effect of Recognition Errors on Text Clustering , 2004 .

[8] T. Caliński,et al. A dendrite method for cluster analysis , 1974 .

[9] Xin Liu,et al. Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[10] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11] Beth A. Carlson. Unsupervised topic clustering of switchboard speech messages , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12] George Karypis,et al. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.