A Supervised Clustering Method for Text Classification

This paper describes a supervised three-tier clustering method for classifying students' essays of qualitative physics in the Why2-Atlas tutoring system. Our main purpose of categorizing text in our tutoring system is to map the students' essay statements into principles and misconceptions of physics. A simple ‘bag-of-words' representation using a naive-bayes algorithm to categorize text was unsatisfactory for our purposes of analyses as it exhibited many misclassifications because of the relatedness of the concepts themselves and its inability to handle misconceptions. Hence, we investigate the performance of the k-nearest neighborhood algorithm coupled with clusters of physics concepts on classifying students' essays. We use a three-tier tagging schemata (cluster, sub-cluster and class) for each document and found that this kind of supervised hierarchical clustering leads to a better understanding of the student's essay.

[1]  Carolyn Penstein Rosé,et al.  A Hybrid Text Classification Approach for Analysis of Student Essays , 2003, HLT-NAACL 2003.

[2]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[3]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[6]  M. Chi,et al.  Assessing Students' Misclassifications of Physics Concepts: An Ontological Basis for Conceptual Change , 1995 .

[7]  Carolyn Penstein Rosé,et al.  The Architecture of Why2-Atlas: A Coach for Qualitative Physics Essay Writing , 2002, Intelligent Tutoring Systems.

[8]  Steffen Staab,et al.  Text Clustering Based on Background Knowledge , 2003 .

[9]  Arthur C. Graesser,et al.  Using Latent Semantic Analysis to Evaluate the Contributions of Students in AutoTutor , 2000, Interact. Learn. Environ..

[10]  V. E. Henderson,et al.  An Experimental Inquiry into Spinal Anesthesia. , 1932 .

[11]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[12]  Naftali Tishby,et al.  The Power of Word Clusters for Text Classification , 2006 .

[13]  Michelene T. H. Chi,et al.  Eliciting Self-Explanations Improves Understanding , 1994, Cogn. Sci..

[14]  Ran El-Yaniv,et al.  Iterative Double Clustering for Unsupervised and Semi-supervised Learning , 2001, ECML.

[15]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[16]  Luc De Raedt,et al.  Machine Learning: ECML 2001 , 2001, Lecture Notes in Computer Science.

[17]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.