Multilingual Spectral Clustering Using Document Similarity Propagation

We present a novel approach for multilingual document clustering using only comparable corpora to achieve cross-lingual semantic interoperability. The method models document collections as weighted graph, and supervisory information is given as sets of must-linked constraints for documents in different languages. Recursive k-nearest neighbor similarity propagation is used to exploit the prior knowledge and merge two language spaces. Spectral method is applied to find the best cuts of the graph. Experimental results show that using limited supervisory information, our method achieves promising clustering results. Furthermore, since the method does not need any language dependent information in the process, our algorithm can be applied to languages in various alphabetical systems.

[1]  Jianbo Shi,et al.  Learning Segmentation by Random Walks , 2000, NIPS.

[2]  Chih-Ping Wei,et al.  A Latent Semantic Indexing-based approach to multilingual document clustering , 2008, Decis. Support Syst..

[3]  Charles A. Micchelli,et al.  On Spectral Learning , 2010, J. Mach. Learn. Res..

[4]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[5]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[6]  Dell Zhang,et al.  Extracting community structure features for hypertext classification , 2008, 2008 Third International Conference on Digital Information Management.

[7]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[8]  Carlo Strapparava,et al.  Cross Language Text Categorization by Acquiring Multilingual Domain Models from Comparable Corpora , 2005, ParallelText@ACL.

[9]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[10]  Stefan Siersdorfer,et al.  Restrictive clustering and metaclustering for self-organizing document collections , 2004, SIGIR '04.

[11]  M. Fiedler A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory , 1975 .

[12]  Hsin-Hsi Chen,et al.  A Muitilingual News Summarizer , 2000, COLING.

[13]  Philip S. Yu,et al.  On the merits of building categorization systems by supervised clustering , 1999, KDD '99.

[14]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[15]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[16]  Bruno Pouliquen,et al.  Multilingual and cross-lingual news topic tracking , 2004, COLING.