A Pattern Based Two-Stage Text Classifier

In a classification problem typically we face two challenging issues, the diverse characteristic of negative documents and sometimes a lot of negative documents that are closed to positive documents. Therefore, it is hard for a single classifier to clearly classify incoming documents into classes. This paper proposes a novel gradual problem solving to create a two-stage classifier. The first stage identifies reliable negatives (negative documents with weak positive characteristics). It concentrates on minimizing the number of false negative documents (recall-oriented). We use Rocchio, an existing recall based classifier, for this stage. The second stage is a precision-oriented "fine tuning", concentrates on minimizing the number of false positive documents by applying pattern (a statistical phrase) mining techniques. In this stage a pattern-based scoring is followed by threshold setting (thresholding). Experiment shows that our statistical phrase based two-stage classifier is promising.

[1]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[2]  Yiming Yang,et al.  Multilabel classification with meta-level features , 2010, SIGIR.

[3]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[4]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[5]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[6]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  Stephen E. Robertson,et al.  Building a filtering test collection for TREC 2002 , 2003, SIGIR.

[9]  Yuefeng Li,et al.  Mining ontology for automatically acquiring Web user information needs , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Hui Zhao,et al.  Text Classification Improved through Automatically Extracted Sequences , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[13]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[14]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[15]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[16]  Yue Xu,et al.  Deploying Approaches for Pattern Refinement in Text Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[17]  Raymond Y. K. Lau,et al.  A two-stage decision model for information filtering , 2012, Decis. Support Syst..

[18]  Yuefeng Li,et al.  Mining positive and negative patterns for relevance feature discovery , 2010, KDD.

[19]  Gang Zhou,et al.  An Extensive Empirical Study of Feature Selection for Text Categorization , 2008, Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008).

[20]  Berkant Barla Cambazoglu,et al.  Review of "Search Engines: Information Retrieval in Practice" by Croft, Metzler and Strohman , 2010, Inf. Process. Manag..

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  Yi Zhang,et al.  Maximum likelihood estimation for filtering thresholds , 2001, SIGIR '01.

[23]  Raymond Y. K. Lau,et al.  A two-stage text mining model for information filtering , 2008, CIKM '08.