Word Sense Clustering and Clusterability

Word sense disambiguation and the related field of automated word sense induction traditionally assume that the occurrences of a lemma can be partitioned into senses. But this seems to be a much easier task for some lemmas than others. Our work builds on recent work that proposes describing word meaning in a graded fashion rather than through a strict partition into senses; in this article we argue that not all lemmas may need the more complex graded analysis, depending on their partitionability. Although there is plenty of evidence from previous studies and from the linguistics literature that there is a spectrum of partitionability of word meanings, this is the first attempt to measure the phenomenon and to couple the machine learning literature on clusterability with word usage data used in computational linguistics.We propose to operationalize partitionability as clusterability, a measure of how easy the occurrences of a lemma are to cluster. We test two ways of measuring clusterability: (1) existing measures from the machine learning literature that aim to measure the goodness of optimal k-means clusterings, and (2) the idea that if a lemma is more clusterable, two clusterings based on two different “views” of the same data points will be more congruent. The two views that we use are two different sets of manually constructed lexical substitutes for the target lemma, on the one hand monolingual paraphrases, and on the other hand translations. We apply automatic clustering to the manual annotations. We use manual annotations because we want the representations of the instances that we cluster to be as informative and “clean” as possible. We show that when we control for polysemy, our measures of clusterability tend to correlate with partitionability, in particular some of the type-(1) clusterability measures, and that these measures outperform a baseline that relies on the amount of overlap in a soft clustering.

[1]  Adam Kilgarriff,et al.  "I Don’t Believe in Word Senses" , 1997, Comput. Humanit..

[2]  G. Lakoff,et al.  Women, Fire, and Dangerous Things: What Categories Reveal about the Mind , 1988 .

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  Deniz Yuret,et al.  KU: Word Sense Disambiguation by Substitution , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[5]  Suresh Manandhar,et al.  SemEval-2010 Task 14: Evaluation Setting for Word Sense Induction & Disambiguation Systems , 2009, SEW@NAACL-HLT.

[6]  Diana McCarthy Word Sense Disambiguation: An Overview , 2009, Lang. Linguistics Compass.

[7]  Roberto Navigli,et al.  Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction , 2013, CL.

[8]  Malik Magdon-Ismail,et al.  Measuring Similarity between Sets of Overlapping Clusters , 2010, 2010 IEEE Second International Conference on Social Computing.

[9]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[10]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology) , 2006 .

[11]  Ted Pedersen,et al.  Unsupervised Corpus-Based Methods for WSD , 2007 .

[12]  Shai Ben-David,et al.  Clusterability: A Theoretical Study , 2009, AISTATS.

[13]  Julio Gonzalo,et al.  The role of named entities in Web People Search , 2009, EMNLP.

[14]  Philip Resnik,et al.  A Perspective on Word Sense Disambiguation Methods and Their Evaluation , 2002 .

[15]  Christiane Fellbaum,et al.  The MASC Word Sense Corpus , 2012, LREC.

[16]  Eneko Agirre,et al.  Computational semantic analysis of language: SemEval-2007 and beyond , 2009, Lang. Resour. Evaluation.

[17]  Helge Dyvik,et al.  Translations as semantic mirrors: from parallel corpus to wordnet , 2004 .

[18]  Marine Carpuat,et al.  Improving Statistical Machine Translation Using Word Sense Disambiguation , 2007, EMNLP.

[19]  Marianna Apidianaki,et al.  Data-Driven Semantic Analysis for Multilingual WSD and Lexical Selection in Translation , 2009, EACL.

[20]  Christiane Fellbaum,et al.  Building Semantic Concordances , 1998 .

[21]  Sabine Schulte im Walde Experiments on the Automatic Induction of German Semantic Verb Classes , 2006, CL.

[22]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[23]  Chris Biemann,et al.  Crowdsourcing WordNet , 2009 .

[24]  Rada Mihalcea,et al.  The cross-lingual lexical substitution task , 2013, Lang. Resour. Evaluation.

[25]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[26]  Roberto Navigli,et al.  SemEval-2007 Task 10: English Lexical Substitution Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[27]  Nancy Ide,et al.  Word Sense Annotation of Polysemous Words by Multiple Annotators , 2010, LREC.

[28]  Suresh Manandhar,et al.  SemEval-2010 Task 14: Word Sense Induction &Disambiguation , 2010, SemEval@ACL.

[29]  ResnikPhilip,et al.  Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation , 1999 .

[30]  Nancy Ide,et al.  Sense Discrimination with Parallel Corpora , 2002, SENSEVAL.

[31]  Marianna Apidianaki,et al.  Semantic Clustering of Pivot Paraphrases , 2014, LREC.

[32]  Markus Dickinson,et al.  Using semi-experts to derive judgments on word sense alignment: a pilot study , 2012, LREC.

[33]  Marianna Apidianaki Translation-oriented Word Sense Induction Based on Parallel Corpora , 2008, LREC.

[34]  Katrin Erk,et al.  Investigations on Word Senses and Word Usages , 2009, ACL.

[35]  Martha Palmer,et al.  Improving English verb sense disambiguation performance with linguistically motivated features and clear sense distinction boundaries , 2009, Lang. Resour. Evaluation.

[36]  K. Krippendorff Krippendorff, Klaus, Content Analysis: An Introduction to its Methodology . Beverly Hills, CA: Sage, 1980. , 1980 .

[37]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[38]  G. Lakoff Women, fire, and dangerous things : what categories reveal about the mind , 1989 .

[39]  Katrin Erk,et al.  Measuring Word Meaning in Context , 2013, CL.

[40]  Martha Palmer,et al.  Semantic Tagging for the Penn Treebank , 2000, LREC.

[41]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[42]  Mohammed J. Zaki,et al.  Clusterability Detection and Initial Seed Selection in Large Data Sets , 1999 .

[43]  Patrick Hanks,et al.  Do Word Meanings Exist? , 2000, Comput. Humanit..

[44]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[45]  Véronique Hoste,et al.  SemEval-2010 Task 3: Cross-Lingual Word Sense Disambiguation , 2010, SemEval@ACL.

[46]  Ted Briscoe,et al.  Semi-productive Polysemy and Sense Extension , 1995, J. Semant..

[47]  Anna Korhonen,et al.  Improving Verb Clustering with Automatically Acquired Selectional Preferences , 2009, EMNLP.

[48]  Yorick Wilks,et al.  Making Sense About Sense , 2007 .

[49]  Jean Véronis,et al.  HyperLex: lexical cartography for information retrieval , 2004, Comput. Speech Lang..

[50]  Rada Mihalcea,et al.  SemEval-2010 Task 2: Cross-Lingual Lexical Substitution , 2009, SemEval@ACL.

[51]  Krister Lindén Word Senses , 2005 .

[52]  Diana McCarthy,et al.  Measuring Similarity of Word Meaning in Context with Lexical Substitutes and Translations , 2011, CICLing.

[53]  David Jurgens,et al.  SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses , 2013, SemEval@NAACL-HLT.

[54]  John DeNero,et al.  Unsupervised Translation Sense Clustering , 2012, NAACL.

[55]  David Tuggy Ambiguity, polysemy, and vagueness , 1993 .

[56]  Roberto Navigli,et al.  The English lexical substitution task , 2009, Lang. Resour. Evaluation.