Word Association Profiles and their Use for Automated Scoring of Essays

We describe a new representation of the content vocabulary of a text we call word association profile that captures the proportions of highly associated, mildly associated, unassociated, and dis-associated pairs of words that co-exist in the given text. We illustrate the shape of the distirbution and observe variation with genre and target audience. We present a study of the relationship between quality of writing and word association profiles. For a set of essays written by college graduates on a number of general topics, we show that the higher scoring essays tend to have higher percentages of both highly associated and dis-associated pairs, and lower percentages of mildly associated pairs of words. Finally, we use word association profiles to improve a system for automated scoring of essays.

[1]  Stefan Evert,et al.  Corpora and collocations , 2007 .

[2]  Jill Burstein,et al.  AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .

[3]  Stephen J. Green,et al.  Automated Link Generation: Can we do Better than Term Repetition? , 1998, Comput. Networks.

[4]  Alessandro Lenci,et al.  Distributional Memory: A General Framework for Corpus-Based Semantics , 2010, CL.

[5]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[6]  Mirella Lapata,et al.  Modeling Local Coherence: An Entity-Based Approach , 2005, ACL.

[7]  Michael Halliday,et al.  Cohesion in English , 1976 .

[8]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[9]  Ilyas Cicekli,et al.  Using lexical chains for keyword extraction , 2007, Inf. Process. Manag..

[10]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[11]  Diana Inkpen,et al.  Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts , 2005, HLT.

[12]  Christiane Fellbaum,et al.  Temporal Indexing Through Lexical Chaining , 1998 .

[13]  Khurshid Ahmad,et al.  Sentiment Polarity Identification in Financial News: A Cohesion-based Approach , 2007, ACL.

[14]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[15]  Roberto Basili,et al.  Distributional lexical semantics: Toward uniform representation paradigms for advanced acquisition and processing tasks , 2010, Natural Language Engineering.

[16]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[17]  Michael Hoey,et al.  Patterns of Lexis In Text , 1991 .

[18]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[19]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[20]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[21]  Graeme Hirst,et al.  Lexical Chains Using Distributional Measures of Concept Distance , 2010, CICLing.

[22]  Joemon M. Jose,et al.  Text segmentation via topic modeling: an analytical study , 2009, CIKM.

[23]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[24]  Daniel Marcu,et al.  Evaluating Multiple Aspects of Coherence in Student Essays , 2004, NAACL.

[25]  Joel R. Tetreault,et al.  Using Entity-Based Features to Model Coherence in Student Essays , 2010, HLT-NAACL.

[26]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[27]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[28]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[29]  K. Sheehan,et al.  When Do Standard Approaches for Measuring Vocabulary Difficulty , Syntactic Complexity and Referential Cohesion Yield Biased Estimates of Text Difficulty ? , 2008 .

[30]  Karen Kukich,et al.  Evaluation of text coherence for electronic essay scoring systems , 2004, Natural Language Engineering.

[31]  Alan F. Smeaton,et al.  SeLeCT: a lexical cohesion based news story segmentation system , 2004, AI Commun..

[32]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[33]  Michael N Jones,et al.  Representing word meaning and order information in a composite holographic lexicon. , 2007, Psychological review.

[34]  Katrin Erk,et al.  A Structured Vector Space Model for Word Meaning in Context , 2008, EMNLP.

[35]  Christian Biemann,et al.  How Text Segmentation Algorithms Gain from Topic Models , 2012, NAACL.

[36]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[37]  Joshua B. Tenenbaum,et al.  The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth , 2001, Cogn. Sci..

[38]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[39]  Kathleen F. McCoy,et al.  Efficiently Computed Lexical Chains as an Intermediate Representation for Automatic Text Summarization , 2002, CL.

[40]  Ziqi Zhang,et al.  Recent advances in methods of lexical semantic relatedness – a survey , 2012, Natural Language Engineering.

[41]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[42]  CicekliIlyas,et al.  Using lexical chains for keyword extraction , 2007 .

[43]  Martin Chodorow,et al.  A computational approach to detecting collocation errors in the writing of non-native speakers of English , 2008 .

[44]  Iryna Gurevych,et al.  Semantic Similarity Applied to Spoken Dialogue Summarization , 2004, COLING.

[45]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[46]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[47]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[48]  Pascale Sébillot,et al.  Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation , 2012, Comput. Speech Lang..

[49]  Y. Attali,et al.  Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring , 2013 .

[50]  R. Kazman,et al.  Temporal Indexing Through Lexical Chaining , 1998 .

[51]  Michal Rosen-Zvi,et al.  Hidden Topic Markov Models , 2007, AISTATS.

[52]  Michael Flor,et al.  A fast and flexible architecture for very large word n-gram datasets , 2012, Natural Language Engineering.