Clustering Paraphrases by Word Sense

Automatically generated databases of English paraphrases have the drawback that they return a single list of paraphrases for an input word or phrase. This means that all senses of polysemous words are grouped together, unlike WordNet which partitions different senses into separate synsets. We present a new method for clustering paraphrases by word sense, and apply it to the Paraphrase Database (PPDB). We investigate the performance of hierarchical and spectral clustering algorithms, and systematically explore different ways of defining the similarity matrix that they use as input. Our method produces sense clusters that are qualitatively and quantitatively good, and that represent a substantial improvement to the PPDB resource.

[1]  M. A. R T H A P A L,et al.  Making fine-grained and coarse-grained sense distinctions , both manually and automatically , 2005 .

[2]  Chris Callison-Burch,et al.  Expectations of Word Sense in Parallel Corpora , 2012, NAACL.

[3]  Philip Resnik,et al.  Word Sense Disambiguation within a Multilingual Framework , 2003 .

[4]  Roberto Navigli,et al.  SemEval-2007 Task 10: English Lexical Substitution Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[5]  Malvina Nissim,et al.  Adding Semantics to Data-Driven Paraphrasing , 2015, ACL.

[6]  Anna Korhonen,et al.  Hierarchical Verb Clustering Using Graph Factorization , 2011, EMNLP.

[7]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[8]  Yifan He,et al.  An algorithm for cross-lingual sense-clustering tested in a MT evaluation setting , 2010, IWSLT.

[9]  Chris Callison-Burch,et al.  Syntactic Constraints on Paraphrases Extracted from Parallel Corpora , 2008, EMNLP.

[10]  Kenneth Ward Church,et al.  Using bilingual materials to develop word sense disambiguation methods , 1992, TMI.

[11]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[12]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[13]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[14]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[15]  Diana McCarthy,et al.  SemEval-2007 Task 10: English Lexical Substitution Task , 2007, *SEMEVAL.

[16]  Chris Callison-Burch,et al.  FrameNet+: Fast Paraphrastic Tripling of FrameNet , 2015, ACL.

[17]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[18]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[19]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[20]  Marianna Apidianaki,et al.  Semantic Clustering of Pivot Paraphrases , 2014, LREC.

[21]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[22]  Ido Dagan,et al.  Modeling Word Meaning in Context with Substitute Vectors , 2015, NAACL.

[23]  Chris Callison-Burch,et al.  The Multilingual Paraphrase Database , 2014, LREC.

[24]  Volker Tresp,et al.  Soft Clustering on Graphs , 2005, NIPS.

[25]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  LinDekang,et al.  Discovery of inference rules for question-answering , 2001 .

[28]  Mehmet Ali Yatbaz,et al.  Learning Syntactic Categories Using Paradigmatic Representations of Word Context , 2012, EMNLP.

[29]  R. Rapp Word sense discovery based on sense descriptor dissimilarity , 2003, MTSUMMIT.

[30]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[31]  Chris Callison-Burch,et al.  PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , 2015, ACL.

[32]  Suresh Manandhar,et al.  SemEval-2010 Task 14: Word Sense Induction &Disambiguation , 2010, SemEval@ACL.