Automatic analysis of semantic similarity in comparable text through syntactic tree matching

We propose to analyse semantic similarity in comparable text by matching syntactic trees and labeling the alignments according to one of five semantic similarity relations. We present a Memory-based Graph Matcher (MBGM) that performs both tasks simultaneously as a combination of exhaustive pairwise classification using a memory-based learner, followed by global optimization of the alignments using a combinatorial optimization algorithm. The method is evaluated on a monolingual treebank consisting of comparable Dutch news texts. Results show that it performs substantially above the baseline and close to the human reference.

[1]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[2]  Daniel Gildea,et al.  Loosely Tree-Based Alignment for Machine Translation , 2003, ACL.

[3]  Andy Way,et al.  Robust language pair-independent sub-tree alignment , 2007, MTSUMMIT.

[4]  Alon Lavie,et al.  Syntax-Driven Learning of Sub-Sentential Translation Equivalents and Translation Rules from Parsed Parallel Corpora , 2008, SSST@ACL.

[5]  Christopher D. Manning,et al.  A Phrase-Based Alignment Model for Natural Language Inference , 2008, EMNLP.

[6]  Emiel Krahmer,et al.  Classification of Semantic Relations by Humans and Machines , 2005, EMSEE@ACL.

[7]  Andy Way,et al.  Robust Sub-Sentential Alignment of Phrase-Structure Trees , 2004, COLING.

[8]  Isa Maks,et al.  Integrating Lexical Units, Synsets and Ontology in the Cornetto Database , 2008, LREC.

[9]  Emiel Krahmer,et al.  Query-based Sentence Fusion is Better Defined and Leads to More Preferred Results than Generic Sentence Fusion , 2008, ACL.

[10]  Gertjan van Noord,et al.  Alpino: Wide-coverage Computational Analysis of Dutch , 2000, CLIN.

[11]  Emiel Krahmer,et al.  Explorations in Sentence Fusion , 2005, ENLG.

[12]  Jörg Tiedemann,et al.  Building a Large Machine-Aligned Parallel Treebank , 2009 .

[13]  Regina Barzilay,et al.  Sentence Fusion for Multidocument News Summarization , 2005, CL.

[14]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[15]  Dragomir R. Radev,et al.  Generating Natural Language Summaries from Multiple On-Line Sources , 1998, CL.

[16]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[17]  Christopher D. Manning,et al.  Modeling Semantic Containment and Exclusion in Natural Language Inference , 2008, COLING.