Cross-lingual Similarity Calculation for Plagiarism Detection and More - Tools and Resources

Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary JRC-Who we are • European Commission (scientific-technical arm of public administration) • Non-commercial • Multidisciplinary / multilingual • Main product: Europe Media Monitor (EMM) • ~ 150,000 online news articles / day in ~ 50 languages • ~ 3600 Sources (worldwide , with focus on Europe) • In-depth analysis in 20 languages (NewsExplorer) • 24/7, updated every 10 minutes • Freely accessible via Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary Monolingual PD work • N-gram overlap between pairs of documents • Karp-Rabin algorithm, using word 5-grams • to weed out duplicates in the IAEA document database (ca. 350K documents) • to find news article near-duplicates in EMM (applied to all news clusters) • Method: Search for longest (in chars) word 6-grams of each document in EC database and on the web (avoiding strings from document template) • If target documents pass similarity threshold: • Full-text comparison of matching documents to detect significant matches • Visualise document overlap and manually check. Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary Multilingual NER Merging name variants 20% + 80% Condition: • For all newly found name forms, detect whether they are a variant of an existing NE: • Transliteration; • Normalisation, using ~30 handwritten rules and removing vowels; • Calculate similarity (threshold: 94%). • Below threshold AE new entity • For frequent or highly visible names, manually launch a Wikipedia mining process. • Check for each variant of a name whether there is a Wikipedia entry. • New name variants, in all scripts, will be recognised in new EMM …