The Tool for the Automatic Analysis of Cohesion 2.0: Integrating semantic similarity and text overlap

This article introduces the second version of the Tool for the Automatic Analysis of Cohesion (TAACO 2.0). Like its predecessor, TAACO 2.0 is a freely available text analysis tool that works on the Windows, Mac, and Linux operating systems; is housed on a user’s hard drive; is easy to use; and allows for batch processing of text files. TAACO 2.0 includes all the original indices reported for TAACO 1.0, but it adds a number of new indices related to local and global cohesion at the semantic level, reported by latent semantic analysis, latent Dirichlet allocation, and word2vec. The tool also includes a source overlap feature, which calculates lexical and semantic overlap between a source and a response text (i.e., cohesion between the two texts based measures of text relatedness). In the first study in this article, we examined the effects that cohesion features, prompt, essay elaboration, and enhanced cohesion had on expert ratings of text coherence, finding that global semantic similarity as reported by word2vec was an important predictor of coherence ratings. A second study was conducted to examine the source and response indices. In this study we examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency. The results indicated that the percentage of keywords found in both the source and response and the similarity between the source document and the response, as reported by word2vec, were significant predictors of speaking quality. Combined, these findings help validate the new indices reported for TAACO 2.0.

[1]  Richard H. Haswell Documenting Improvement in College Writing , 2000 .

[2]  T. Hothorn,et al.  Simultaneous Inference in General Parametric Models , 2008, Biometrical journal. Biometrische Zeitschrift.

[3]  Danielle Mcnamara,et al.  Predicting math performance using natural language processing tools , 2017, LAK.

[4]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[5]  Danielle S. McNamara,et al.  The Development of Writing Proficiency as a Function of Grade Level: A Linguistic Analysis , 2011 .

[6]  Alister Cumming,et al.  A teacher-verification study of speaking and writing prototype tasks for a new TOEFL , 2004 .

[7]  Danielle S. McNamara,et al.  Predicting Second Language Writing Proficiency: The Roles of Cohesion and Linguistic Sophistication , 2012 .

[8]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[9]  Vania Dimitrova,et al.  Quantified Self Analytics Tools for Self-regulated Learning with myPAL , 2017, ARTEL@EC-TEL.

[10]  Scott A. Crossley,et al.  The Role of Lexical Properties and Cohesive Devices in Text Integration and Their Effect on Human Ratings of Speaking Proficiency , 2014 .

[11]  Danielle S McNamara,et al.  Natural language processing in an intelligent writing strategy tutoring system , 2012, Behavior Research Methods.

[12]  Danielle S. McNamara,et al.  Applications of Text Analysis Tools for Spoken Response Grading , 2013 .

[13]  Danielle S. McNamara,et al.  Text Coherence and Judgments of Essay Quality: Models of Quality and Coherence , 2011, CogSci.

[14]  Sean Owen,et al.  Mahout in Action , 2011 .

[15]  Meihua Liu,et al.  Cohesive features in argumentative writing produced by Chinese undergraduates , 2005 .

[16]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[17]  A. Jacobs,et al.  What’s in the brain that ink may character ….: A quantitative narrative analysis of Shakespeare’s 154 sonnets for use in (Neuro-)cognitive poetics , 2017 .

[18]  Jerome L. Neuner Cohesive Ties and Chains in Good and Poor Freshman Essays. , 1987 .

[19]  Danielle S. McNamara,et al.  Learning from texts: Effects of prior knowledge and text coherence , 1996 .

[20]  Danielle S. McNamara,et al.  Construct validity in TOEFL iBT speaking tasks: Insights from natural language processing , 2016 .

[21]  Yves Bestgen,et al.  Checking and bootstrapping lexical norms by means of word similarity indexes , 2012, Behavior Research Methods.

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[24]  Danielle S McNamara,et al.  The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion , 2015, Behavior Research Methods.

[25]  Michael Halliday,et al.  Cohesion in English , 1976 .

[26]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[27]  Blair C. Armstrong,et al.  The Cambridge Handbook of Psycholinguistics: Decoding, Orthographic Learning, and the Development of Visual Word Recognition , 2012 .

[28]  Steve Chiang,et al.  The importance of cohesive conditions to perceptions of writing quality at the early stages of foreign language learning , 2003 .

[29]  W. Kintsch,et al.  Are Good Texts Always Better? Interactions of Text Coherence, Background Knowledge, and Levels of Understanding in Learning From Text , 1996 .

[30]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[31]  Danielle S. McNamara,et al.  Using Automatic Scoring Models to Detect Changes in Student Writing in an Intelligent Tutoring System , 2013, FLAIRS Conference.

[32]  Philip M. McCarthy,et al.  Linguistic Features of Writing Quality , 2010 .

[33]  Marc Spoelders,et al.  Text cohesion: An exploratory study with beginning writers , 1985, Applied Psycholinguistics.

[34]  Marion Crowhurst,et al.  Cohesion in Argument and Narration at Three Grade Levels , 1987, Research in the Teaching of English.

[35]  Danielle S. McNamara,et al.  Predicting human judgments of essay quality in both integrated and independent second language writing samples: A comparison study , 2013 .

[36]  Rod D. Roscoe,et al.  Automated formative writing assessment using a levels of language framework , 2017 .

[37]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[38]  Ute Römer,et al.  Applying Natural Language Processing Tools to a Student Academic Writing Corpus: How Large are Disciplinary Differences Across Science and Engineering Fields? , 2017 .

[39]  Arthur C. Graesser,et al.  Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[40]  Alister Cumming,et al.  Analysis of Discourse Features and Verification of Scoring Levels for Independent and Integrated Prototype Written Tasks for the New TOEFL®. TOEFL® Monograph Series. MS-30. ETS RM-05-13. , 2005 .

[41]  Laura K. Allen,et al.  Cohesion network analysis of CSCL participation , 2017, Behavior Research Methods.

[42]  Evan F. Risko,et al.  Disfluency effects on lexical selection , 2017, Cognition.

[43]  Margaret G. McKeown,et al.  The Effects of Thinking Aloud during Reading on Students' Comprehension of More or Less Coherent Text. , 1994 .

[44]  G. McCulley,et al.  Writing Quality, Coherence, and Cohesion. , 1985 .

[45]  D. McCutchen Domain knowledge and linguistic knowledge in the development of writing ability , 1986 .

[46]  Alison Mackey,et al.  Exploring the Relationship between Modified Output and Working Memory Capacity. , 2010 .

[47]  Kasia Muldner,et al.  Identifying Creativity During Problem Solving Using Linguistic Features , 2017 .

[48]  Randall W. Engle,et al.  Validating running memory span: Measurement of working memory capacity and links with fluid intelligence , 2010, Behavior research methods.

[49]  Charles A. Perfetti,et al.  Coherence and connectedness in the development of discourse production , 1982 .

[50]  Abdoljavad Jafarpur,et al.  Cohesiveness as a Basis for Evaluating Compositions. , 1991 .

[51]  Nicholas D. Duran,et al.  The Next Frontier in Communication and the ECLIPPSE Study: Bridging the Linguistic Divide in Secure Messaging , 2017, Journal of diabetes research.

[52]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[53]  Saralees Nadarajah,et al.  Multivariate T-Distributions and Their Applications , 2004 .

[54]  Mark Davies The Corpus of Contemporary American English (COCA) , 2012 .

[55]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[56]  Martha L. King,et al.  Toward a Theory of Early Writing Development. , 1979 .

[57]  W. Kintsch How readers construct situation models for stories. The role of suntactic cues and causal inferences , 1992 .

[58]  D. McNamara,et al.  The Impact of Science Knowledge, Reading Skill, and Reading Strategy Knowledge on More Traditional “High-Stakes” Measures of High School Students’ Science Achievement , 2007 .

[59]  Dixie Lee Spiegel,et al.  Enhancing Children's Reading Comprehension through Instruction in Narrative Structure , 1983 .

[60]  Shinichi Nakagawa,et al.  A general and simple method for obtaining R2 from generalized linear mixed‐effects models , 2013 .

[61]  Morton Ann Gernsbacher,et al.  Language Comprehension As Structure Building , 1990 .

[62]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[63]  Dan Douglas,et al.  Testing speaking ability in academic contexts : theoretical considerations , 1997 .

[64]  Timothy T Rogers,et al.  Computational Models of Semantic Memory , 2022 .

[65]  N. L. Johnson,et al.  Continuous Multivariate Distributions: Models and Applications , 2005 .

[66]  D. McNamara,et al.  Cohesion, coherence, and expert evaluations of writing proficiency , 2010 .

[67]  Danielle S. McNamara,et al.  Say More and Be More Coherent: How Text Elaboration and Cohesion Can Increase Writing Quality. , 2016 .

[68]  Ted Sanders,et al.  Cohesion and Coherence: Linguistic Approaches , 2006 .