NESM: a named entity based proximity measure for multilingual news clustering

Measuring the similarity between documents is an essential task in Document Clustering. This paper presents a new metric that is based on the number and the category of the Named Entities shared between news documents. Three different feature-weighting functions and two standard similarity measures were used to evaluate the quality of the proposed measure in multilingual news clustering. The results, with three different collections of comparable news written in English and Spanish, indicate that the new metric performance is in some cases better than standard similarity measures such as cosine similarity and correlation coefficient.

[1]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[2]  Bruno Pouliquen,et al.  Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC , 2002, CICLing.

[3]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[4]  Romaric Besançon,et al.  Multilingual document clusters discovery , 2004, RIAO.

[5]  Wai Lam,et al.  Financial activity mining from online multilingual news , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[6]  Bruno Pouliquen,et al.  Navigating multilingual news collections using automatically extracted information , 2005 .

[7]  Chung-Hsing Yeh,et al.  A Neural Network Model for Hierarchical Multilingual Text Categorization , 2005, ISNN.

[8]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[9]  Xavier Carreras,et al.  FreeLing: An Open-Source Suite of Language Analyzers , 2004, LREC.

[10]  Satoshi Sekine,et al.  Named Entity Discovery Using Comparable News Articles , 2004, COLING.

[11]  Bruno Pouliquen,et al.  Multilingual and cross-lingual news topic tracking , 2004, COLING.

[12]  Quintin Armour The role of named entities in text classification , 2005 .

[13]  Xiaojin Zhu,et al.  Correlation Clustering for Crosslingual Link Detection , 2007, IJCAI.

[14]  Lawrence J. Leftin Newsblaster Russian-English Clustering Performance Analysis , 2003 .

[15]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[16]  Marc Teboulle,et al.  Data Driven Similarity Measures for k-Means Like Clustering Algorithms , 2005, Information Retrieval.

[17]  Bruno Pouliquen,et al.  Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications , 2006, ArXiv.

[18]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[19]  Soto Montalvo,et al.  Bilingual News Clustering Using Named Entities and Fuzzy Similarity , 2007, TSD.

[20]  Ke Wu,et al.  Cross-Lingual Document Clustering , 2007, PAKDD.

[21]  Chirag Shah,et al.  Representing documents with named entities for story link detection (SLD) , 2006, CIKM '06.

[22]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.