Measuring the Relatedness between Documents in Comparable Corpora

This paper aims at investigating the use of textual distributional similarity measures in the context of comparable corpora. We address the issue of measuring the relatedness between documents by extracting, measuring and ranking their common content. For this purpose, we designed and applied a methodology that exploits available natural language processing technology with statistical methods. Our findings showed that using a list of common entities and a simple, yet robust set of distributional similarity measures was enough to describe and assess the degree of relatedness between the documents. Moreover, our method has demonstrated high performance in the task of filtering out documents with a low level of relatedness. By a way of example, one of the measures got 100%, 100%, 95% and 90% precision when injected 5%, 10%, 15% and 20% of noise, respectively.

[1]  Miriam Seghiri,et al.  Virtual corpora as documentation resources: Translating travel insurance documents (English-Spanish) , 2009 .

[2]  Hernani Costa Assessing Comparable Corpora through Distributional Similarity Measures , 2015 .

[3]  Gloria Corpas Pastor Compilación de un corpus ad hoc para la enseñanza de la traducción inversa especializada , 2017 .

[4]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[5]  Paulo Gomes,et al.  The Impact of Distributional Metrics in the Quality of Relational Triples , 2010 .

[6]  Hugo Gonçalo Oliveira,et al.  Using the Web to Validate Lexico-Semantic Relations , 2011, EPIA.

[7]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[8]  G. Leech,et al.  Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus , 1997 .

[9]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[10]  Rohit Gupta,et al.  MiniExperts: An SVM Approach for Measuring Semantic Textual Similarity , 2015, *SEMEVAL.

[11]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[12]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[13]  Ishwar K. Sethi,et al.  The performance analysis of a Chi-square similarity measure for topic related clustering of noisy transcripts , 2002, Object recognition supported by user interaction for service robots.

[14]  Hernani Costa Automatic Extraction and Validation of Lexical Ontologies from Text , 2011 .

[15]  A. Kilgarriff Comparing Corpora , 2001 .