Early text classification: a Naïve solution

Text classification is a widely studied problem, and it can be considered solved for some domains and under certain circumstances. There are scenarios, however, that have received little or no attention at all, despite its relevance and applicability. One of such scenarios is early text classification, where one needs to know the category of a document by using partial information only. A document is processed as a sequence of terms, and the goal is to devise a method that can make predictions as fast as possible. The importance of this variant of the text classification problem is evident in domains like sexual predator detection, where one wants to identify an offender as early as possible. This paper analyzes the suitability of the standard naive Bayes classifier for approaching this problem. Specifically, we assess its performance when classifying documents after seeing an increasingly number of terms. A simple modification to the standard naive Bayes implementation allows us to make predictions with partial information. To the best of our knowledge naive Bayes has not been used for this purpose before. Throughout an extensive experimental evaluation we show the effectiveness of the classifier for early text classification. What is more, we show that this simple solution is very competitive when compared with state of the art methodologies that are more elaborated. We foresee our work will pave the way for the development of more effective early text classification techniques based in the naive Bayes formulation.

[1]  Patrick Gallinari,et al.  Sequential approaches for learning datum-wise sparse representations , 2012, Machine Learning.

[2]  Xiaoqing Ding,et al.  Improving Naive Bayes Text Classifier Using Smoothing Methods , 2007, ECIR.

[3]  Hugo Jair Escalante,et al.  Distributional Term Representations for Short-Text Categorization , 2013, CICLing.

[4]  Geoffrey I. Webb,et al.  Not So Naive Bayes: Aggregating One-Dependence Estimators , 2005, Machine Learning.

[5]  Hugo Jair Escalante,et al.  A Two-step Approach for Effective Detection of Misbehaving Users in Chats , 2012, CLEF.

[6]  Geoffrey I. Webb,et al.  Classifying under computational resource constraints: anytime classification using probabilistic estimators , 2007, Machine Learning.

[7]  Geoffrey I. Webb,et al.  Alleviating naive Bayes attribute independence assumption by attribute weighting , 2013, J. Mach. Learn. Res..

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  Josep Roure Alcobé Incremental Learning of Tree Augmented Naive Bayes Classifiers , 2002, IBERAMIA.

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  Fabio Crestani,et al.  Overview of the International Sexual Predator Identification Competition at PAN-2012 , 2012, CLEF.

[12]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[13]  Marc Teboulle,et al.  Grouping Multidimensional Data - Recent Advances in Clustering , 2006 .

[14]  Patrick Gallinari,et al.  Text Classification: A Sequential Reading Approach , 2011, ECIR.

[15]  Patrick Gallinari,et al.  HMM-based passage models for document classification and ranking , 2001 .

[16]  Frank Klawonn Evolving Extended Naı̈ve Bayes Classifiers , .

[17]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[18]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[19]  Geoffrey I. Webb,et al.  Anytime classification for a pool of instances , 2009, Machine Learning.

[20]  Ying Li,et al.  Exploiting term relationship to boost text classification , 2009, CIKM.

[21]  Nadia Magnenat-Thalmann,et al.  Enhancing naive bayes with various smoothing methods for short text classification , 2012, WWW.