论文信息 - Cross-Language High Similarity Search Using a Conceptual Thesaurus

Cross-Language High Similarity Search Using a Conceptual Thesaurus

This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.

Alberto Barrón-Cedeño | Parth Gupta | Paolo Rosso

[1] Bruno Pouliquen,et al. Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC , 2002, CICLing.

[2] James Mayfield,et al. Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[3] Joshua Alspector,et al. Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[4] Jimmy J. Lin,et al. No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[5] Benno Stein,et al. A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[6] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[7] Bruno Pouliquen,et al. Automatic linking of similar texts across languages , 2003, RANLP.

[8] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.

[9] Benno Stein,et al. Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[10] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[11] Alberto Barrón-Cedeño,et al. A statistical approach to crosslingual natural language tasks , 2008, LA-NMR.

[12] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13] John C. Platt,et al. Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[14] Peter Ingwersen,et al. Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[15] Alberto Barrón-Cedeño,et al. On Cross-lingual Plagiarism Analysis using a Statistical Model , 2008, PAN.

[16] Bruno Pouliquen,et al. Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.

[17] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[18] Benno Stein,et al. Cross-Language High Similarity Search: Why No Sub-linear Time Bound Can Be Expected , 2010, ECIR.