Unsupervised clustering of spontaneous speech documents

This paper presents an unsupervised method for clustering spontaneous speech documents. The approach uses a hierarchical algorithm to automatically determine the number of clusters and a starting model for a subsequent iterative algorithm. We have evaluated this method on the Switchboard corpus and compared it to a set of supervised and other unsupervised methods. The results show that our method significantly outperforms the rest of the approaches.

[1]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[2]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[3]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[5]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[6]  Mihai Surdeanu,et al.  A hybrid unsupervised approach for document clustering , 2005, KDD '05.

[7]  Alessandro Vinciarelli,et al.  Effect of Recognition Errors on Text Clustering , 2004 .

[8]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[9]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Beth A. Carlson Unsupervised topic clustering of switchboard speech messages , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.