Comparison of Naive Bayes, Random Forest, Decision Tree, Support Vector Machines, and Logistic Regression Classifiers for Text Reviews Classification

Today, a largely scalable computing environment provides a possibility of carrying out various data-intensive natural language processing and machine-learning tasks. One of these is text classification with some issues recently investigated by many data scientists. The authors of this paper investigate Naïve Bayes, Random Forest, Decision Tree, Support Vector Machines, and Logistic Regression classifiers implemented in Apache Spark, i.e. the in-memory intensive computing platform. The focus of the paper is on comparing these classifiers by evaluating the classification accuracy, based on the size of training data sets, and the number of n-grams. In experiments, short texts for product-review data from Amazon were analyzed.

[1]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[2]  Rong Jin,et al.  Understanding bag-of-words model: a statistical framework , 2010, Int. J. Mach. Learn. Cybern..

[3]  S. Archana,et al.  Survey of Classification Techniques in Data Mining , 2014 .

[4]  Eugene Wong,et al.  High-performance computing and communications , 1992, VIS '92.

[5]  Maya R. Gupta,et al.  Training highly multiclass classifiers , 2014, J. Mach. Learn. Res..

[6]  Jennifer Widom,et al.  Challenges and Opportunities with Big Data 2012-2 , 2011 .

[7]  Pascal Monasse,et al.  Precise Correction of Lateral Chromatic Aberration in Images , 2013, PSIVT.

[8]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[9]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[10]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[11]  Erik Cambria,et al.  Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article] , 2014, IEEE Computational Intelligence Magazine.

[12]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[13]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[14]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[15]  Yong Yang,et al.  An Automatic Hybrid Method for Retinal Blood Vessel Extraction , 2008, Int. J. Appl. Math. Comput. Sci..

[16]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[17]  Jure Leskovec,et al.  Inferring Networks of Substitutable and Complementary Products , 2015, KDD.

[18]  P. Matula,et al.  An efficient algorithm for measurement and correction of chromatic aberrations in fluorescence microscopy , 2000, Journal of microscopy.

[19]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[20]  Lior Rokach,et al.  Top-down induction of decision trees classifiers - a survey , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[21]  Thair Nu Phyu Survey of Classification Techniques in Data Mining , 2009 .

[22]  Woo-Jin Song,et al.  Removing chromatic aberration by digital image processing , 2010 .

[23]  P. Pöntinen,et al.  STUDY ON CHROMATIC ABERRATION OF TWO FISHEYE LENSES , 2008 .

[24]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[25]  Jure Leskovec,et al.  Antisocial Behavior in Online Discussion Communities , 2015, ICWSM.

[26]  Steven A. Shafer,et al.  Active lens control for high precision computer imaging , 1991, Proceedings. 1991 IEEE International Conference on Robotics and Automation.

[27]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[28]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[29]  Sing Bing Kang Automatic Removal of Chromatic Aberration from a Single Image , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Structures and Algorithms with Java — Fall 2017 , .

[31]  James T. Kwok,et al.  Efficient Multi-label Classification with Many Labels , 2013, ICML.

[32]  Lei Gu,et al.  Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[33]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[34]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[35]  Frédéric Zana,et al.  Segmentation of vessel-like patterns using mathematical morphology and curvature evaluation , 2001, IEEE Trans. Image Process..

[36]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[37]  Morten H. Christiansen,et al.  Language Evolution: The Hardest Problem in Science? , 2003 .

[38]  Michael J. Kidger Importance of aberration theory in understanding lens design , 1997, Other Conferences.

[39]  Jorge J. Moré,et al.  The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[40]  Roberto Marcondes Cesar Junior,et al.  Retinal vessel segmentation using the 2-D Gabor wavelet and supervised classification , 2005, IEEE Transactions on Medical Imaging.

[41]  U. Fayyad Knowledge Discovery and Data Mining: An Overview , 1995 .