Experiments with Sentence Classification

We present a set of experiments involving sentence classification, addressing issues of representation and feature selection, and we compare our findings with similar results from work on the more general text classification task. The domain of our investigation is an email-based help-desk corpus. Our investigations compare the use of various popular classification algorithms with various popular feature selection methods. The results highlight similarities between sentence and text classification, such as the superiority of Support Vector Machines, as well as differences, such as a lesser extent of the usefulness of features selection on sentence classification, and a detrimental effect of common preprocessing techniques (stop-word removal and lemmatization).

[1]  Mark G. Core,et al.  Coding Dialogs with the DAMSL Annotation Scheme , 1997 .

[2]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[3]  Chao Wang,et al.  A semantic classification approach for online product reviews , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[4]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5]  Padmini Srinivasan,et al.  Categorization of Sentence Types in Medical Abstracts , 2003, AMIA.

[6]  D. Marom,et al.  Sentence Classifier for Helpdesk Emails , 2005 .

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Arlindo L. Oliveira,et al.  An Empirical Comparison of Text Categorization Methods , 2003, SPIRE.

[9]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[10]  Liang Zhou,et al.  Multi-Document Biography Summarization , 2005, EMNLP.

[11]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[12]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[13]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[14]  Edward Ivanovic,et al.  Dialogue Act Tagging for Instant Messaging Chat Sessions , 2005, ACL.

[15]  Javed Mostafa,et al.  An application of text categorization methods to gene ontology annotation , 2005, SIGIR '05.

[16]  Michael Gamon,et al.  Task-Focused Summarization of Email , 2004 .

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Tom M. Mitchell,et al.  Learning to Classify Email into “Speech Acts” , 2004, EMNLP.

[19]  A. Stolcke,et al.  Automatic detection of discourse structure for speech recognition and understanding , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.