Cross-Language High Similarity Search Using a Conceptual Thesaurus

This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.

[1]  Bruno Pouliquen,et al.  Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC , 2002, CICLing.

[2]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[3]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[4]  Jimmy J. Lin,et al.  No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[5]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[6]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[7]  Bruno Pouliquen,et al.  Automatic linking of similar texts across languages , 2003, RANLP.

[8]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[9]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[10]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[11]  Alberto Barrón-Cedeño,et al.  A statistical approach to crosslingual natural language tasks , 2008, LA-NMR.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  John C. Platt,et al.  Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[14]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[15]  Alberto Barrón-Cedeño,et al.  On Cross-lingual Plagiarism Analysis using a Statistical Model , 2008, PAN.

[16]  Bruno Pouliquen,et al.  Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.

[17]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[18]  Benno Stein,et al.  Cross-Language High Similarity Search: Why No Sub-linear Time Bound Can Be Expected , 2010, ECIR.