Lossless Pruned Naive Bayes for Big Data Classifications

Abstract In a fast growing big data era, volume and varieties of data processed in Internet applications drastically increase. Real-world search engines commonly use text classifiers with thousands of classes to improve relevance or data quality. These large scale classification problems lead to severe runtime performance challenges, so practitioners often resort to fast approximation techniques. However, the increase in classification speed comes at a cost, as approximations are lossy, mis-assigning classes relative to the original reference classification algorithm. To address this problem, we introduce a Lossless Pruned Naive Bayes (LPNB) classification algorithm tailored to real-world, big data applications with thousands of classes. LPNB achieves significant speed-ups by drawing on Information Retrieval (IR) techniques for efficient posting list traversal and pruning. We show empirically that LPNB can classify text up to eleven times faster than standard Naive Bayes on a real-world data set with 7205 classes, with larger gains extrapolated for larger taxonomies. In practice, the achieved acceleration is significant as it can greatly cut required computation time. In addition, it is lossless: the output is identical to standard Naive Bayes, in contrast to extant techniques such as hierarchical classification. The acceleration does not rely on the taxonomy structure, and it can be used for both hierarchical and flat taxonomies.

[1]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[2]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[3]  Xiaolin Wang,et al.  Flatten hierarchies for large-scale hierarchical text categorization , 2010, 2010 Fifth International Conference on Digital Information Management (ICDIM).

[4]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[5]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[6]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[7]  Dan Shen,et al.  Large-scale item categorization for e-commerce , 2012, CIKM.

[8]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[9]  W. Bruce Croft,et al.  Efficient document retrieval in main memory , 2007, SIGIR.

[10]  Qiang Yang,et al.  Deep classifier: automatically categorizing search results into large-scale hierarchies , 2008, WSDM '08.

[11]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[12]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[13]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[14]  Qiang Yang,et al.  Deep classification in large-scale text hierarchies , 2008, SIGIR '08.

[15]  Huan Liu,et al.  Acclimatizing Taxonomic Semantics for Hierarchical Content Classification , 2006, KDD '06.

[16]  Alistair Moffat,et al.  Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[17]  Paul N. Bennett,et al.  Refined experts: improving classification in large taxonomies , 2009, SIGIR.

[18]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.