论文信息 - Cross-lingual Similarity Calculation for Plagiarism Detection and More - Tools and Resources

Cross-lingual Similarity Calculation for Plagiarism Detection and More - Tools and Resources

Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary JRC-Who we are • European Commission (scientific-technical arm of public administration) • Non-commercial • Multidisciplinary / multilingual • Main product: Europe Media Monitor (EMM) • ~ 150,000 online news articles / day in ~ 50 languages • ~ 3600 Sources (worldwide , with focus on Europe) • In-depth analysis in 20 languages (NewsExplorer) • 24/7, updated every 10 minutes • Freely accessible via Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary Monolingual PD work • N-gram overlap between pairs of documents • Karp-Rabin algorithm, using word 5-grams • to weed out duplicates in the IAEA document database (ca. 350K documents) • to find news article near-duplicates in EMM (applied to all news clusters) • Method: Search for longest (in chars) word 6-grams of each document in EC database and on the web (avoiding strings from document template) • If target documents pass similarity threshold: • Full-text comparison of matching documents to detect significant matches • Visualise document overlap and manually check. Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary Multilingual NER Merging name variants 20% + 80% Condition: • For all newly found name forms, detect whether they are a variant of an existing NE: • Transliteration; • Normalisation, using ~30 handwritten rules and removing vowels; • Calculate similarity (threshold: 94%). • Below threshold AE new entity • For frequent or highly visible names, manually launch a Wikipedia mining process. • Check for each variant of a name whether there is a Wikipedia entry. • New name variants, in all scripts, will be recognised in new EMM …

Ralf Steinberger | R. Steinberger

[1] Michael L. Littman,et al. A statistical method for language-independent representation of the topical content of text segments , 2007 .

[2] Ralf Steinberger,et al. JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool , 2012, LREC.

[3] Bruno Pouliquen,et al. Automatic Identification of Document Translations in Large Multilingual Document Collections , 2006, ArXiv.

[4] Steinberger Ralf,et al. Automatic Construction of Multilingual Name Dictionaries , 2009 .

[5] Kevin Knight,et al. Machine Transliteration , 1997, CL.

[6] Marc Dymetman,et al. Automatic Construction of Multilingual Name Dictionaries , 2009 .

[7] Marc Dymetman,et al. Learning Machine Translation , 2010 .

[8] Bruno Pouliquen,et al. JRC-NAMES: A Freely Available, Highly Multilingual Named Entity Resource , 2011, RANLP.

[9] Nello Cristianini,et al. Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.

[10] Bruno Pouliquen,et al. An introduction to the Europe Media Monitor family of applications , 2013, ArXiv.

[11] Benno Stein,et al. Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.