Text classification with the support of pruned dependency patterns

We propose a novel text classification approach based on two main concepts, lexical dependency and pruning. We extend the standard bag-of-words method by including dependency patterns in the feature vector. We perform experiments with 37 lexical dependencies and the effect of each dependency type is analyzed separately in order to identify the most discriminative dependencies. We analyze the effect of pruning (filtering features with low frequencies) for both word features and dependency features. Parameter tuning is performed with eight different pruning levels to determine the optimal levels. The experiments were repeated on three datasets with different characteristics. We observed a significant improvement on the success rates as well as a reduction on the dimensionality of the feature vector. We argue that, in contrast to the works in the literature, a much higher pruning level should be used in text classification. By analyzing the results from the dataset perspective, we also show that datasets in similar formality levels have similar leading dependencies and show close behavior with varying pruning levels.

[1]  Mark Stevenson,et al.  Comparing Information Extraction Pattern Models , 2006 .

[2]  Ron Larson,et al.  Elementary Statistics: Picturing the World , 1999 .

[3]  ZhangYong,et al.  Automatic scientific text classification using local patterns , 2002 .

[4]  M. Felisa Verdejo,et al.  Textual Entailment Recognition Based on Dependency Analysis and WordNet , 2005, MLCW.

[5]  Kenji Yamada,et al.  Syntax-based language models for statistical machine translation , 2003, ACL 2003.

[7]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[8]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[9]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[10]  Eric Brill,et al.  Reducing the human overhead in text categorization , 2006, KDD '06.

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[13]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[14]  Nigel Collier,et al.  Sentiment Analysis using Support Vector Machines with Diverse Information Sources , 2004, EMNLP.

[15]  WalkerS.,et al.  Experimentation as a way of life , 2000 .

[16]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[17]  Mark Stevenson,et al.  A Semantic Approach to IE Pattern Induction , 2005, ACL.

[18]  Moustafa Ghanem,et al.  Automatic scientific text classification using local patterns: KDD CUP 2002 (task 1) , 2002, SKDD.

[19]  Levent Özgür,et al.  Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[20]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[21]  James Pustejovsky,et al.  Classification of Discourse Coherence Relations: An Exploratory Study using Multiple Knowledge Sources , 2006, SIGDIAL Workshop.

[22]  Ziqiang Wang,et al.  Feature Selection in Text Classification Via SVM and LSI , 2006, ISNN.

[23]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[24]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[25]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[26]  Wei-Ying Ma,et al.  Improving text classification using local latent semantic indexing , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[27]  Hong‐Hee Lee,et al.  Abstract , 1998, Veterinary Record.

[28]  Roberto Basili,et al.  An Adaptive and Distributed Framework for Advanced IR , 2000, RIAO.

[29]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[30]  Stephen E. Robertson,et al.  Experimentation as a way of life: Okapi at TREC , 2000, Inf. Process. Manag..

[31]  Levent Özgür,et al.  Analysis of Stemming Alternatives and Dependency Pattern Support in Text Classification , 2009 .

[32]  Alessandro Moschitti,et al.  Kernel methods, syntax and semantics for relational text categorization , 2008, CIKM '08.

[33]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[34]  Matt Post,et al.  Syntax-based language models for statistical machine translation , 2010 .

[35]  Ulrich Heid,et al.  USING TRI-LEXICAL DEPENDENCIES IN LFG PARSE DISAMBIGUATION , 2009 .