Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data. In order to solve this problem, this paper proposes and implements a parallel naive Bayes algorithm (PNBA) for Chinese text classification based on Spark, a parallel memory computing platform for big data. This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets (RDD). For comparison, a PNBA based on Hadoop is also implemented. The test results show that in the same computing environment and for the same text sets, the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability. Therefore, Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.摘要针对互联网中中文文本数据量激增使得对其作分类运算的处理时间显著延长的问题,提出并实 现了一种基于内存计算模型Spark 的并行朴素贝叶斯中文文本分类算法,主要利用弹性分布数据集编 程模型,实现了朴素贝叶斯分类器训练过程和预测过程的全程并行化算法。为便于比较,同时实现了 基于Hadoop-MapReduce 的并行朴素贝叶斯版本。实验结果表明,在相同计算环境下,对同一数据量 的中文文本集,基于Spark 的朴素贝叶斯中文文本分类并行化算法在加速比、扩展性等主要指标上明 显优于基于Hadoop 的实现,因此能更好地满足大规模中文文本数据挖掘的要求。

[1]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[2]  Xian-He Sun,et al.  Scalability of Parallel Algorithm-Machine Combinations , 1994, IEEE Trans. Parallel Distributed Syst..

[3]  Bo Yan,et al.  Microblog Sentiment Classification Using Parallel SVM in Apache Spark , 2017, 2017 IEEE International Congress on Big Data (BigData Congress).

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[6]  Liu Yang,et al.  WordNet-based lexical semantic classification for text corpus analysis , 2015 .

[7]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[8]  K. R. Bindu,et al.  A comprehensive study of text classification algorithms , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[9]  Eero Vainikko,et al.  Scalability of parallel scientific applications on the cloud , 2011, CloudCom 2011.

[10]  Henry Living,et al.  Review of: Seibold, Chris Mac OS X Snow Leopard. Pocket guide Sebastopol, CA: O'Reilly Media, Inc., 2009 , 2009, Inf. Res..

[11]  Sally A. McKee,et al.  Understanding the behavior of in-memory computing workloads , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[13]  Francine Berman,et al.  Grid Computing: Making the Global Infrastructure a Reality , 2003 .

[14]  Raouf Boutaba,et al.  Cloud computing: state-of-the-art and research challenges , 2010, Journal of Internet Services and Applications.

[15]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[16]  Davide Anguita,et al.  Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf , 2015, INNS Conference on Big Data.

[17]  Genshe Chen,et al.  Scalable sentiment classification for Big Data analysis using Naïve Bayes Classifier , 2013, 2013 IEEE International Conference on Big Data.

[18]  Hui Wang,et al.  Parallel Implementation of Classification Algorithms Based on Cloud Computing Environment , 2012 .

[19]  Virginijus Marcinkevičius,et al.  Application of Logistic Regression with part-of-the-speech tagging for multi-class text classification , 2016, 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE).