Performance impact of stop lists and morphological decomposition on word–word corpus-based semantic space models

Corpus-based semantic space models, which primarily rely on lexical co-occurrence statistics, have proven effective in modeling and predicting human behavior in a number of experimental paradigms that explore semantic memory representation. The most widely studied extant models, however, are strongly influenced by orthographic word frequency (e.g., Shaoul & Westbury, Behavior Research Methods, 38, 190–195, 2006). This has the implication that high-frequency closed-class words can potentially bias co-occurrence statistics. Because these closed-class words are purported to carry primarily syntactic, rather than semantic, information, the performance of corpus-based semantic space models may be improved by excluding closed-class words (using stop lists) from co-occurrence statistics, while retaining their syntactic information through other means (e.g., part-of-speech tagging and/or affixes from inflected word forms). Additionally, very little work has been done to explore the effect of employing morphological decomposition on the inflected forms of words in corpora prior to compiling co-occurrence statistics, despite (controversial) evidence that humans perform early morphological decomposition in semantic processing. In this study, we explored the impact of these factors on corpus-based semantic space models. From this study, morphological decomposition appears to significantly improve performance in word–word co-occurrence semantic space models, providing some support for the claim that sublexical information—specifically, word morphology—plays a role in lexical semantic processing. An overall decrease in performance was observed in models employing stop lists (e.g., excluding closed-class words). Furthermore, we found some evidence that weakens the claim that closed-class words supply primarily syntactic information in word–word co-occurrence semantic space models.

[1]  J. Bullinaria Semantic Categorization Using Simple Word Co-occurrence Statistics , 2022 .

[2]  Alec Marantz,et al.  Evidence for Early Morphological Decomposition in Visual Word Recognition , 2010, Journal of Cognitive Neuroscience.

[3]  Douglas L. T. Rohde,et al.  An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence , 2005 .

[4]  Curt Burgess,et al.  From simple associations to the building blocks of language: Modeling meaning in memory with the HAL model , 1998 .

[5]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[6]  J. Elman An alternative view of the mental lexicon , 2004, Trends in Cognitive Sciences.

[7]  S. Dumais Latent Semantic Analysis. , 2005 .

[8]  Peter D. Turney Similarity of Semantic Relations , 2006, CL.

[9]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[10]  Curt Burgess,et al.  Characterizing semantic space: Neighborhood effects in word recognition , 2001, Psychonomic bulletin & review.

[11]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[12]  Michael Smithson,et al.  A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. , 2006, Psychological methods.

[13]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[14]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[15]  C. Osgood,et al.  The Measurement of Meaning , 1958 .

[16]  Joan L. Bybee,et al.  Regular morphology and the lexicon. , 1995 .

[17]  Walter Kintsch,et al.  Comprehension: A Paradigm for Cognition , 1998 .

[18]  Cyrus Shaoul,et al.  Word frequency effects in high-dimensional co-occurrence models: A new approach , 2006, Behavior research methods.

[19]  Sally Andrews,et al.  Frequency and neighborhood effects on lexical access: Lexical similarity or orthographic redundancy? , 1992 .

[20]  N. Chater,et al.  Proceedings of the fourteenth annual conference of the cognitive science society , 1992 .

[21]  W. Montague,et al.  Category norms of verbal items in 56 categories A replication and extension of the Connecticut category norms , 1969 .

[22]  Stefan Evert,et al.  Evaluating Neighbor Rank and Distance Measures as Predictors of Semantic Priming , 2013, CMCL.

[23]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[24]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[25]  John R. Anderson,et al.  The Adaptive Character of Thought , 1990 .

[26]  Dušica Filipović Đurđević,et al.  An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. , 2011, Psychological review.

[27]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[28]  Don L. Scarborough,et al.  Frequency and Repetition Effects in Lexical Memory. , 1977 .

[29]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[30]  Joseph P. Levy,et al.  Learning Lexical Properties from Word Usage Patterns: Which Context Words Should be Used? , 2000, NCPW.

[31]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[32]  Rebecca Treiman,et al.  The English Lexicon Project , 2007, Behavior research methods.

[33]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[34]  Charles A. Perfetti,et al.  The limits of co‐occurrence: Tools and theories in language research , 1998 .

[35]  Mark S. Seidenberg,et al.  Explaining derivational morphology as the convergence of codes , 2000, Trends in Cognitive Sciences.

[36]  M. Taft Morphological Decomposition and the Reverse Base Frequency Effect , 2004, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[37]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[38]  Alec Marantz,et al.  A single route, full decomposition model of morphological complexity: MEG evidence , 2006 .

[39]  A. Zeileis,et al.  Beta Regression in R , 2010 .

[40]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[41]  Evelyne Tzoukermann,et al.  Information retrieval based on context distance and morphology , 1999, SIGIR '99.

[42]  John A Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD , 2012, Behavior Research Methods.

[43]  Michael N Jones,et al.  Representing word meaning and order information in a composite holographic lexicon. , 2007, Psychological review.

[44]  Cyrus Shaoul,et al.  Exploring lexical co-occurrence space using HiDEx , 2010, Behavior research methods.

[45]  Richard Sproat,et al.  Review of PC-KIMMO: a two-level processor for morphological analysis by Evan L. Antworth. Summer Institute of Linguistics 1990 , 1991 .

[46]  Derek Besner,et al.  Word recognition and identification: Do word-frequency effects reflect lexical access? , 1988 .

[47]  W. Marslen-Wilson,et al.  Abstractness, Allomorphy, and Lexical Architecture , 1999 .

[48]  Cyrus Shaoul,et al.  HiDEx: The High Dimensional Explorer , 2012 .

[49]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[50]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[51]  Marco Marelli,et al.  Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics , 2013, ACL.

[52]  C. Osgood The nature and measurement of meaning. , 1952, Psychological bulletin.

[53]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[54]  R. Baayen,et al.  Reading polymorphemic Dutch compounds: toward a multiple route model of lexical processing. , 2009, Journal of experimental psychology. Human perception and performance.

[55]  W. Bruce Croft,et al.  Quary Expansion Using Local and Global Document Analysis , 1996, SIGIR Forum.

[56]  Malti Patel,et al.  Extracting Semantic Representations from Large Text Corpora , 1997, NCPW.

[57]  R. Weale Vision. A Computational Investigation Into the Human Representation and Processing of Visual Information. David Marr , 1983 .

[58]  Lori Buchanan,et al.  WINDSOR: Windsor improved norms of distance and similarity of representations of semantics , 2008, Behavior research methods.

[59]  Kenneth Ward Church One term or two? , 1995, SIGIR '95.

[60]  Fred J. Damerau,et al.  Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts , 1993, Inf. Process. Manag..

[61]  David Poeppel,et al.  Compound words and structure in the lexicon , 2007 .

[62]  A. Caramazza How many levels of processing are there in lexical access , 1997 .

[63]  R. Rapp Word sense discovery based on sense descriptor dissimilarity , 2003, MTSUMMIT.

[64]  J. Grainger Word frequency and neighborhood frequency effects in lexical decision and naming. , 1990 .

[65]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[66]  Stephen Clark,et al.  A Systematic Study of Semantic Vector Space Model Parameters , 2014, CVSC@EACL.

[67]  Roger W. Schvaneveldt,et al.  Pathfinder associative networks: studies in knowledge organization , 1990 .

[68]  Nick Chater,et al.  BOOTSTRAPPING SYNTACTIC CATEGORIES , 1992 .

[69]  S. Ferrari,et al.  Beta Regression for Modelling Rates and Proportions , 2004 .

[70]  Andrew E. Smith,et al.  Evaluation of unsupervised semantic mapping of natural language with Leximancer concept mapping , 2006, Behavior research methods.

[71]  Walter Kintsch,et al.  Predication , 2001, Cogn. Sci..

[72]  C. A. Becker,et al.  Morphological structure and its effect on visual word recognition , 1979 .

[73]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[74]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .