Multilingual news clustering: Feature translation vs. identification of cognate named entities

In this paper we evaluate the influence of different document representations in the results of multilingual news clustering. We aim at proving whether or not the use of only named entities is a good source of knowledge for multilingual news clustering. We compare two approaches: one based on feature translation, and another based on cognate identification. Our main contribution is using only some categories of cognate named entities like document representation features to perform multilingual news clustering, without the need of translation resources. The results show that the use of cognate named entities, as the only type of features to represent news, leads to good multilingual clustering performance, comparable to the one obtained by using the feature translation approach.

[1]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[2]  Bruno Pouliquen,et al.  Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC , 2002, CICLing.

[3]  Hsin-Hsi Chen,et al.  A Muitilingual News Summarizer , 2000, COLING.

[4]  Lluís Padró,et al.  FreeLing 1.3: Syntactic and semantic services in an open-source NLP library , 2006, LREC.

[5]  Lawrence J. Leftin Newsblaster Russian-English Clustering Performance Analysis , 2003 .

[6]  Federico Neri,et al.  Text Mining Applied to Multilingual Corpora , 2005 .

[7]  Piek Vossen Introduction to EuroWordNet , 1998 .

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Soto Montalvo,et al.  Multilingual Document Clustering: An Heuristic Approach Based on Cognate Named Entities , 2006, ACL.

[10]  Luis Alfonso Ureña López,et al.  Generación de un tesauro de similitud multilingüe a partir de un corpus comparable aplicado a CLIR , 2002, Proces. del Leng. Natural.

[11]  Bruno Pouliquen,et al.  Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications , 2006, ArXiv.

[12]  Chung-Hsing Yeh,et al.  A Neural Network Model for Hierarchical Multilingual Text Categorization , 2005, ISNN.

[13]  Martin Braschler,et al.  Experiments with the Eurospider Retrieval System for CLEF 2000 , 2000, CLEF.

[14]  Denis Maurel,et al.  Textual Similarity based on Proper Names , 2002 .

[15]  Martin Braschler,et al.  Experiments with the Eurospider Retrieval System for CLEF 2001 , 2000, CLEF.

[16]  Helen M. Meng,et al.  Using contextual analysis for news event detection , 2001, Int. J. Intell. Syst..

[17]  Romaric Besançon,et al.  Multilingual document clusters discovery , 2004, RIAO.

[18]  Bruno Pouliquen,et al.  Multilingual and cross-lingual news topic tracking , 2004, COLING.

[19]  C. J. van Rijsbergen,et al.  FOUNDATION OF EVALUATION , 1974 .

[20]  Wai Lam,et al.  MULTILINGUAL TOPIC DETECTION USING A PARALLEL CORPUS , 2000 .

[21]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .