Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary JRC-Who we are • European Commission (scientific-technical arm of public administration) • Non-commercial • Multidisciplinary / multilingual • Main product: Europe Media Monitor (EMM) • ~ 150,000 online news articles / day in ~ 50 languages • ~ 3600 Sources (worldwide , with focus on Europe) • In-depth analysis in 20 languages (NewsExplorer) • 24/7, updated every 10 minutes • Freely accessible via Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary Monolingual PD work • N-gram overlap between pairs of documents • Karp-Rabin algorithm, using word 5-grams • to weed out duplicates in the IAEA document database (ca. 350K documents) • to find news article near-duplicates in EMM (applied to all news clusters) • Method: Search for longest (in chars) word 6-grams of each document in EC database and on the web (avoiding strings from document template) • If target documents pass similarity threshold: • Full-text comparison of matching documents to detect significant matches • Visualise document overlap and manually check. Agenda • EC-Joint Research Centre (JRC) – Who we are • Monolingual plagiarism detection (PD) work at the JRC • Cross-lingual similarity calculation at the JRC • Named entity (NE) matching across languages • Linking related news items across languages • Identifying translations of documents • JRC's multilingual tools and resources • Summary Multilingual NER Merging name variants 20% + 80% Condition: • For all newly found name forms, detect whether they are a variant of an existing NE: • Transliteration; • Normalisation, using ~30 handwritten rules and removing vowels; • Calculate similarity (threshold: 94%). • Below threshold AE new entity • For frequent or highly visible names, manually launch a Wikipedia mining process. • Check for each variant of a name whether there is a Wikipedia entry. • New name variants, in all scripts, will be recognised in new EMM …
[1]
Michael L. Littman,et al.
A statistical method for language-independent representation of the topical content of text segments
,
2007
.
[2]
Ralf Steinberger,et al.
JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool
,
2012,
LREC.
[3]
Bruno Pouliquen,et al.
Automatic Identification of Document Translations in Large Multilingual Document Collections
,
2006,
ArXiv.
[4]
Steinberger Ralf,et al.
Automatic Construction of Multilingual Name Dictionaries
,
2009
.
[5]
Kevin Knight,et al.
Machine Transliteration
,
1997,
CL.
[6]
Marc Dymetman,et al.
Automatic Construction of Multilingual Name Dictionaries
,
2009
.
[7]
Marc Dymetman,et al.
Learning Machine Translation
,
2010
.
[8]
Bruno Pouliquen,et al.
JRC-NAMES: A Freely Available, Highly Multilingual Named Entity Resource
,
2011,
RANLP.
[9]
Nello Cristianini,et al.
Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
,
2002,
NIPS.
[10]
Bruno Pouliquen,et al.
An introduction to the Europe Media Monitor family of applications
,
2013,
ArXiv.
[11]
Benno Stein,et al.
Cross-language plagiarism detection
,
2011,
Lang. Resour. Evaluation.