Multilingual document clusters discovery

Cross Language Information Retrieval community has brought up search engines over multilingual corpora, and multilingual text categorization systems. In this paper, we focus on the multilingual clusters discovery problem, which aim is to extract topic-related multilingual document clusters from a multilingual document collection in an unsupervised way. Our approach is based on a linguistic analysis of the documents that allows to identify relevant features for a vector representation of the documents, each language being associated with a different vector space. We propose a cross-lingual similarity measure for the documents, using bilingual dictionaries. A Shared Nearest Neighbor clustering algorithm is then used to build the clusters We present an evaluation framework for this task, analyze and discuss the results we obtained and propose directions for future works.

[1]  Ricco Rakotomalala,et al.  Cadre pour la catégorisation de textes multilingues , 2004 .

[2]  Bruno Pouliquen,et al.  Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC , 2002, CICLing.

[3]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[6]  Carol Peters,et al.  Comparative evaluation of multilingual information access systems : 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August 21-22, 2003 : revised papers , 2004 .

[7]  Jacques Savoy Report on CLEF-2001 Experiments: Effective Combined Query-Translation Approach , 2001, CLEF.

[8]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[9]  José Gabriel Pereira Lopes,et al.  Multilingual Document Clustering, Topic Extraction and Data Transformations , 2001, EPIA.

[10]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[11]  Douglas W. Oard,et al.  Evaluating resources for query translation in cross-language information retrieval , 1998 .

[12]  David Evans,et al.  A Platform for Multilingual News Summarization , 2003 .

[13]  Romaric Besançon,et al.  The LIC2M's CLEF 2003 System , 2003, CLEF.

[14]  Vipin Kumar,et al.  Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach , 2003, Clustering and Information Retrieval.