A Parallel Conditional Random Fields Model Based on Spark Computing Environment

As one of the famous probabilistic graph models in machine learning, the conditional random fields (CRFs) can merge different types of features, and encode known relationships between observations and construct consistent interpretations, which have been widely applied in many areas of the Natural Language Processing (NLP). With the high-speed development of the internet and information systems, some performance issues are certain to arise when the traditional CRFs deals with such massive data. This paper proposes SCRFs, which is a parallel optimization of CRFs based on the Resilient Distributed Datasets (RDD) in the Spark computing framework. SCRFs optimizes the traditional CRFs from these stages: First, with all features are generated in parallel, the intermediate data which will be used frequently are all cached into the memory to speed up the iteration efficiency. By removing the low-frequency features of the model, SCRFs can also prevent the overfitting of the model to improve the prediction effect. Second, some specific features are dynamically added in parallel to correct the model in the training process. And for implementing the efficient prediction, a max-sum algorithm is proposed to infer the most likely state sequence by extending the belief propagation algorithm. Finally, we implement SCRFs base on the version of Spark 1.6.0, and evaluate its performance using two widely used benchmarks: Named Entity Recognition and Chinese Word Segmentation. Compared with the traditional CRFs models running on the Hadoop and Spark platforms respectively, the experimental results illustrate that SCRFs has obvious advantages in terms of the model accuracy and the iteration performance.

[1]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[2]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[3]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[4]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[5]  Miguel Á. Carreira-Perpiñán,et al.  Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[6]  Kenli Li,et al.  CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework , 2015, Cluster Computing.

[7]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[8]  Xihong Wu,et al.  Distributed training for Conditional Random Fields , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[9]  Malvina Nissim,et al.  Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web , 2004, NLPBA/BioNLP.

[10]  Judea Pearl,et al.  Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach , 1982, AAAI.

[11]  Vijay V. Raghavan,et al.  Big Data: Promises and Problems , 2015, Computer.

[12]  Gary Geunbae Lee,et al.  Efficient Inference of CRFs for Large-Scale Natural Language Data , 2009, ACL.

[13]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[14]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[15]  Katharina Morik,et al.  Parallel Inference on Structured Data with CRFs on GPUs , 2012 .

[16]  Baogang Wei,et al.  Improving MapReduce Performance with Partial Speculative Execution , 2015, Journal of Grid Computing.

[17]  Athanasios V. Vasilakos,et al.  An Advanced MapReduce: Cloud MapReduce, Enhancements and Applications , 2014, IEEE Transactions on Network and Service Management.

[18]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[19]  Shih-Hung Wu,et al.  Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities , 2006, Expert systems with applications.

[20]  Sherif Sakr,et al.  Big Data 2.0 Processing Systems: Taxonomy and Open Challenges , 2016, Journal of Grid Computing.

[21]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[22]  Stan Z. Li Markov Random Field Modeling in Image Analysis , 2009, Advances in Pattern Recognition.

[23]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[24]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[25]  Kenli Li,et al.  Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields , 2015, IEEE Transactions on Parallel and Distributed Systems.

[26]  Christopher Joseph Pal,et al.  Sparse Forward-Backward Using Minimum Divergence Beams for Fast Training Of Conditional Random Fields , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[27]  William T. Freeman,et al.  On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs , 2001, IEEE Trans. Inf. Theory.

[28]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[29]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[30]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Yi Guan,et al.  Rich features based Conditional Random Fields for biological named entities recognition , 2007, Comput. Biol. Medicine.

[33]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[34]  Trevor Cohn Efficient Inference in Large Conditional Random Fields , 2006, ECML.

[35]  Sharath Chandra Guntuku,et al.  Big Data Analytics framework for Peer-to-Peer Botnet detection using Random Forests , 2014, Inf. Sci..

[36]  Minyoung Kim Mixtures of Conditional Random Fields for Improved Structured Output Prediction , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[38]  Li Yang,et al.  Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs , 2013, Knowledge and Information Systems.

[39]  Fanjin Mai,et al.  Improved Chinese Word Segmentation Disambiguation Model Based on Conditional Random Fields , 2015 .

[40]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[41]  Gábor Terstyánszky,et al.  Extending Science Gateway Frameworks to Support Big Data Applications in the Cloud , 2016, Journal of Grid Computing.

[42]  Douglas G. Down,et al.  Guidelines for Selecting Hadoop Schedulers Based on System Heterogeneity , 2014, Journal of Grid Computing.

[43]  Thomas Hahn,et al.  Advanced Feature-Driven Disease Named Entity Recognition Using Conditional Random Fields , 2016, BCB.