Analysis of Stemming Alternatives and Dependency Pattern Support in Text Classification

In this paper, we study text classification algorithms by utilizing two concepts from Information Extraction discipline; dependency patterns and stemmer analysis. To the best of our knowledge, this is the first study to fully explore all possible dependency patterns during the formation of the solution vector in the Text Categorization problem. The benchmark of the classical approach in text classification is improved by the proposed method of pattern utilization. The test results show that support of four patterns achieves the highest ranks, namely, participle modifier, adverbal clause modifier, conjunctive and possession modifier. For the stemming process, we benefit from both morphological and syntactic stemming tools, Porter stemmer and Stanford Stemmer, respectively. One of the main contributions of this paper is its approach in stemmer utilization. Stemming is performed not only for the words but also for all the extracted pattern couples in the texts. Porter stemming is observed to be the optimal stemmer for all words while the raw form without stemming slightly outperforms the other approaches in pattern stemming. For the implementation of our algorithm, two formal datasets, Reuters 21578 and National Science Foundation Abstracts, are used.

[1]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Mark Stevenson,et al.  Comparing Information Extraction Pattern Models , 2006 .

[4]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[5]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[6]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[7]  Robert J. Hilderman,et al.  A Characterization of Wordnet Features in Boolean Models For Text Classification , 2006, AusDM.

[8]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[9]  Stephan Bloehdorn,et al.  Combined Syntactic and Semantic Kernels for Text Classification , 2007, ECIR.

[10]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[11]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[14]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[15]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[16]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[17]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[18]  Manuel de Buenaga Rodríguez,et al.  Integrating a Lexical Database and a Training Collection for Text Categorization , 1997, ArXiv.

[19]  Mark Stevenson,et al.  A Semantic Approach to IE Pattern Induction , 2005, ACL.