A database of orthography-semantics consistency (OSC) estimates for 15,017 English words

Orthography–semantics consistency (OSC) is a measure that quantifies the degree of semantic relatedness between a word and its orthographic relatives. OSC is computed as the frequency-weighted average semantic similarity between the meaning of a given word and the meanings of all the words containing that very same orthographic string, as captured by distributional semantic models. We present a resource including optimized estimates of OSC for 15,017 English words. In a series of analyses, we provide a progressive optimization of the OSC variable. We show that computing OSC from word-embeddings models (in place of traditional count models), limiting preprocessing of the corpus used for inducing semantic vectors (in particular, avoiding part-of-speech tagging and lemmatization), and relying on a wider pool of orthographic relatives provide better performance for the measure in a lexical-processing task. We further show that OSC is an important and significant predictor of reaction times in visual word recognition and word naming, one that correlates only weakly with other psycholinguistic variables (e.g., family size, word frequency), indicating that it captures a novel source of variance in lexical access. Finally, some theoretical and methodological implications are discussed of adopting OSC as one of the predictors of reaction times in studies of visual word recognition.

[1]  M. Marelli,et al.  Affixation in semantic space: Modeling morpheme meanings with compositional distributional semantics. , 2015, Psychological review.

[2]  Amac Herdagdelen,et al.  Twitter n-gram corpus with demographic metadata , 2013, Language Resources and Evaluation.

[3]  Marco Marelli,et al.  Social Media and Language Processing: How Facebook and Twitter Provide the Best Frequency Estimates for Studying Word Recognition , 2016, Cogn. Sci..

[4]  M Coltheart,et al.  DRC: a dual route cascaded model of visual word recognition and reading aloud. , 2001, Psychological review.

[5]  Yasushi Hino,et al.  The impact of feedback semantics in visual word recognition: Number-of-features effects in lexical decision and naming tasks , 2002, Psychonomic bulletin & review.

[6]  Morten H. Christiansen,et al.  Why Form-Meaning Mappings Are Not Entirely Arbitrary in Language , 2006 .

[7]  Ian S. Hargreaves,et al.  Is more always better? Effects of semantic richness on lexical decision, speeded pronunciation, and semantic classification , 2011, Psychonomic bulletin & review.

[8]  M. Brysbaert,et al.  Adding part-of-speech information to the SUBTLEX-US word frequencies , 2012, Behavior Research Methods.

[9]  Cristina Burani,et al.  Word reading and picture naming in Italian , 2001, Memory & cognition.

[10]  Dušica Filipović Đurđević,et al.  An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. , 2011, Psychological review.

[11]  J. Grainger Word frequency and neighborhood frequency effects in lexical decision and naming. , 1990 .

[12]  Marc Brysbaert,et al.  Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English , 2009, Behavior research methods.

[13]  Dagmar Divjak,et al.  A Learning Perspective on Individual Differences in Skilled Reading: Exploring and Exploiting Orthographic and Semantic Discrimination Cues , 2017, Journal of experimental psychology. Learning, memory, and cognition.

[14]  Michael J Cortese,et al.  Visual word recognition of single-syllable words. , 2004, Journal of experimental psychology. General.

[15]  Petar Milin,et al.  Discrimination in lexical decision , 2017, PloS one.

[16]  M. Marelli,et al.  From sound to meaning: Phonology-to-Semantics mapping in visual word recognition , 2016, Psychonomic Bulletin & Review.

[17]  Angeliki Lazaridou,et al.  Multimodal Word Meaning Induction From Minimal Exposure to Natural Text. , 2017, Cognitive science.

[18]  Rebecca Treiman,et al.  The English Lexicon Project , 2007, Behavior research methods.

[19]  Morten H. Christiansen,et al.  Arbitrariness, Iconicity, and Systematicity in Language , 2015, Trends in Cognitive Sciences.

[20]  D. Samson,et al.  Orthographic neighborhood and concreteness effects in the lexical decision task , 2004, Brain and Language.

[21]  Marc Brysbaert,et al.  Subtlex-UK: A New and Improved Word Frequency Database for British English , 2014, Quarterly journal of experimental psychology.

[22]  M. Brysbaert,et al.  Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting : A review and empirical validation , 2017 .

[23]  Amy Beth Warriner,et al.  Emotion and language: valence and arousal affect word recognition. , 2014, Journal of experimental psychology. General.

[24]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[25]  J. Bowers,et al.  Automatic semantic activation of embedded words: Is there a hat in that? , 2005 .

[26]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .

[27]  Chris Westbury,et al.  Performance impact of stop lists and morphological decomposition on word–word corpus-based semantic space models , 2015, Behavior research methods.

[28]  Jennifer M. Rodd,et al.  When do leotards get their spots? Semantic activation of lexical neighbors in visual word recognition , 2004, Psychonomic bulletin & review.

[29]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[30]  Marco Marelli,et al.  Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics , 2013, ACL.

[31]  Mark S. Seidenberg,et al.  Computing the meanings of words in reading: cooperative division of labor between visual and phonological processes. , 2004, Psychological review.

[32]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[33]  D. Balota,et al.  Moving beyond Coltheart’s N: A new measure of orthographic similarity , 2008, Psychonomic bulletin & review.

[34]  Ian S. Hargreaves,et al.  There are many ways to be rich: Effects of three measures of semantic richness on visual word recognition , 2008, Psychonomic bulletin & review.

[35]  Jeffrey S Bowers,et al.  What do letter migration errors reveal about letter position coding in visual word recognition? , 2004, Journal of experimental psychology. Human perception and performance.

[36]  Amy Beth Warriner,et al.  Norms of valence, arousal, and dominance for 13,915 English lemmas , 2013, Behavior Research Methods.

[37]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[38]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[39]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[40]  Marc Brysbaert,et al.  The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words , 2011, Behavior Research Methods.

[41]  S. Lupker,et al.  The nature of orthographic–phonological and orthographic–semantic relationships for Japanese kana and kanji words , 2011, Behavior research methods.

[42]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[43]  Dennis Norris,et al.  The Bayesian reader: explaining word recognition as an optimal Bayesian decision process. , 2006, Psychological review.

[44]  Marco Baroni,et al.  Frege in Space: A Program of Compositional Distributional Semantics , 2014 .

[45]  Marco Marelli,et al.  A relatedness benchmark to test the role of determiners in compositional distributional semantics , 2013, ACL.

[46]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[47]  Marco Marelli,et al.  Semantic Transparency in Free Stems: The Effect of Orthography-Semantics Consistency on Word Recognition , 2015, Quarterly journal of experimental psychology.

[48]  S. Andrews The effect of orthographic similarity on lexical retrieval: Resolving neighborhood conflicts , 1997 .

[49]  Geoff Hollis,et al.  The principals of meaning: Extracting semantic dimensions from co-occurrence models of semantics , 2016, Psychonomic Bulletin & Review.

[50]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[51]  Mark J. Huff,et al.  An Abundance of Riches: Cross-Task Comparisons of Semantic Richness Effects in Visual Word Recognition , 2012, Front. Hum. Neurosci..

[52]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[53]  D. Jared,et al.  The Effect of Semantic Transparency on the Processing of Morphologically Derived Words: Evidence From Decision Latencies and Event-Related Potentials , 2017, Journal of experimental psychology. Learning, memory, and cognition.

[54]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[55]  Curt Burgess,et al.  Characterizing semantic space: Neighborhood effects in word recognition , 2001, Psychonomic bulletin & review.

[56]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[57]  D. Pecher,et al.  Perception is a two-way junction: Feedback semantics in word recognition , 2001, Psychonomic bulletin & review.