Latent Semantic Analysis Parameters for Essay Evaluation using Small-Scale Corpora*

Abstract Some previous studies (e.g. that carried out by Van Bruggen et al. in 2004) have pointed to a need for additional research in order to firmly establish the usefulness of LSA (latent semantic analysis) parameters for automatic evaluation of academic essays. The extreme variability in approaches to this technique makes it difficult to identify the most efficient parameters and the optimum combination. With this goal in mind, we conducted a high spectrum study to investigate the efficiency of some of the major LSA parameters in small-scale corpora. We used two specific domain corpora that differed in the structure of the text (one containing only technical terms and the other with more tangential information). Using these corpora we tested different semantic spaces, formed by applying different parameters and different methods of comparing the texts. Parameters varied included weighting functions (Log-IDF or Log-Entropy), dimensionality reduction (truncating the matrices after SVD to a set percentage of dimensions), methods of forming pseudo-documents (vector sum and folding-in) and measures of similarity (cosine or Euclidean distances). We also included two groups of essays to be graded, one written by experts and other by non-experts. Both groups were evaluated by three human graders and also by LSA. We extracted the correlations of each LSA condition with human graders, and conducted an ANOVA to analyse which parameter combination correlates best. Results suggest that distances are more efficient in academic essay evaluation than cosines. We found no clear evidence that the classical LSA protocol works systematically better than some simpler version (the classical protocol achieves the best performance only for some combinations of parameters in a few cases), and found that the benefits of reducing dimensionality arise only when the essays are introduced into semantic spaces using the folding-in method.

[1]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[2]  Peter W. Foltz,et al.  Learning from text: Matching readers and texts by latent semantic analysis , 1998 .

[3]  Danielle S. McNamara,et al.  Identifying reading strategies using latent semantic analysis: Comparing semantic benchmarks , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[4]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[5]  José J. Cañas,et al.  Assessing short summaries with human judgments procedure and latent semantic analysis in narrative and expository texts , 2006, Behavior research methods.

[6]  William M. Pottenger,et al.  A Framework for Understanding LSI Performance , 2004 .

[7]  Arthur C. Graesser,et al.  The Right Stuff: Do You Need to Sanitize Your Corpus When Using Latent Semantic Analysis? , 2002 .

[8]  Preslav Nakov,et al.  Weight functions impact on LSA performance , 2001 .

[9]  Michael P. Stryker,et al.  Seeing the whole picture , 1991, Current Biology.

[10]  Arthur C. Graesser,et al.  Development of Physics Text Corpora for Latent Semantic Analysis , 2001 .

[11]  Brian D. Davison,et al.  Identification of Critical Values in Latent Semantic Indexing , 2005, Foundations of Data Mining and knowledge Discovery.

[12]  Preslav Nakov,et al.  Towards Deeper Understanding of the LSA Performance , 2003 .

[13]  Marian Petre,et al.  A Research Taxonomy for Latent Semantic Analysis- Based Educational Applications , 2005 .

[14]  April Kontostathis,et al.  Analysis of the values in the LSI Term-Term Matrix , 2004 .

[15]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[16]  Bob Rehder,et al.  Using latent semantic analysis to assess knowledge: Some technical considerations , 1998 .

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Susan T. Dumais,et al.  Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval , 1990 .

[19]  Colin Tattersall,et al.  Latent semantic analysis as a tool for learner positioning in learning networks for lifelong learning , 2004, Br. J. Educ. Technol..

[20]  Ricardo Olmos,et al.  New algorithms assessing short summaries in expository texts using latent semantic analysis , 2009, Behavior research methods.

[21]  A. Graesser,et al.  Improving an intelligent tutor ’ s comprehension of students with Latent Semantic Analysis ∗ , 1999 .

[22]  Peter W. Foltz,et al.  Reasoning from Multiple Texts: An Automatic Analysis of Readers? Situation Models , 1996 .

[23]  Stephen Cox,et al.  A comparison of some different techniques for vector based call-routing , 2001, INTERSPEECH.

[24]  Peter W. Foltz,et al.  The intelligent essay assessor: Applications to educational technology , 1999 .

[25]  Marian Petre,et al.  Seeing the Whole Picture: Comparing Computer Assisted Assessment Systems using LSA-based Systems as an Example , 2007 .

[26]  Walter Kintsch,et al.  Predication , 2001, Cogn. Sci..

[27]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[28]  Peter M. Wiemer-Hastings,et al.  How Latent is Latent Semantic Analysis? , 1999, IJCAI.

[29]  Michael B. W. Wolfe,et al.  Use of latent semantic analysis for predicting psychological phenomena: Two issues and proposed solutions , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[30]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[31]  Danielle S. McNamara,et al.  Computerizing reading training: Evaluation of a latent semantic analysis space for science text , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[32]  Preslav Nakov Latent semantic analysis of textual data , 2000, CompSysTech '00.

[33]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[34]  Gustaf Neumann,et al.  Parameters driving effectiveness of automated essay scoring with LSA , 2005 .