ParaDist-HMM: A Parallel Distributed Implementation of Hidden Markov Model for Big Data Analytics using Spark

Big Data is an extremely massive amount of heterogeneous and multisource data which often requires fast processing and real time analysis. Solving big data analytics problems needs powerful platforms to handle this enormous mass of data and efficient machine learning algorithms to allow the use of big data full potential. Hidden Markov models are statistical models, rich and widely used in various fields especially for time varying data sequences modeling and analysis. They owe their success to the existence of many efficient and reliable algorithms. In this paper, we present ParaDist-HMM, a parallel distributed implementation of hidden Markov model for modeling and solving big data analytics problems. We describe the development and the implementation of the improved algorithms and we propose a Spark-based approach consisting in a parallel distributed big data architecture in cloud computing environment, to put the proposed algorithms into practice. We evaluated the model on synthetic and real financial data in terms of running time, speedup and prediction quality which is measured by using the accuracy and the root mean square error. Experimental results demonstrate that ParaDist-HMM algorithms outperforms other implementations of hidden Markov models in terms of processing speed, accuracy and therefore in efficiency and effectiveness. Keywords—Big data; machine learning; Hidden Markov model; forward; backward; baum-welch; parallel distributed computing; spark; cloud computing; ParaDist-HMM

[1]  Chuang Liu,et al.  cuHMM : a CUDA Implementation of Hidden Markov Model Training and Classification , 2009 .

[2]  Mehdi Salkhordeh Haghighi,et al.  Big Data: Current Challenges and Future Scope , 2020, 2020 IEEE 10th Symposium on Computer Applications & Industrial Electronics (ISCAIE).

[3]  Hamza Turabieh,et al.  An Efficient Approach for Storage of Big Data Streams in Distributed Stream Processing Systems , 2020 .

[4]  Reynold Xin,et al.  Scaling Spark in the Real World: Performance and Usability , 2015, Proc. VLDB Endow..

[5]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[6]  W. Zucchini,et al.  Hidden Markov Models for Time Series: An Introduction Using R , 2009 .

[7]  Dan Schonfeld,et al.  Distributed multi-dimensional hidden Markov model: theory and application in multiple-object trajectory classification and recognition , 2008, Electronic Imaging.

[8]  Imad Sassi,et al.  An Overview of Big Data and Machine Learning Paradigms , 2018, Advances in Intelligent Systems and Computing.

[9]  Jun Li,et al.  The fast evaluation of hidden Markov models on GPU , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[10]  David R. Kaeli,et al.  GPU-Accelerated HMM for Speech Recognition , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[11]  Athanasios V. Vasilakos,et al.  Machine learning on big data: Opportunities and challenges , 2017, Neurocomputing.

[12]  LF Reis Techniques , 2007, Modern Pathology.

[13]  Evgeny Nikulchev,et al.  Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark , 2021, Symmetry.

[14]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[15]  AbdelRahman H. Hussein,et al.  Internet of Things (IOT): Research Challenges and Future Applications , 2019, International Journal of Advanced Computer Science and Applications.

[16]  Shawn R. Hymel Massively Parallel Hidden Markov Models for Wireless Applications , 2011 .

[17]  Imad SASSI,et al.  Adaptation of Classical Machine Learning Algorithms to Big Data Context: Problems and Challenges : Case Study: Hidden Markov Models Under Spark , 2019, 2019 1st International Conference on Smart Systems and Data Science (ICSSD).

[18]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.

[19]  Tie-Yan Liu,et al.  Distributed Machine Learning: Foundations, Trends, and Practices , 2017, WWW.

[20]  Maumita Bhattacharya Expensive Optimisation: A Metaheuristics Perspective , 2013, ArXiv.

[21]  J. Lember,et al.  Existence of infinite Viterbi path for pairwise Markov models , 2017, Stochastic Processes and their Applications.

[22]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[23]  D. P. Acharjya,et al.  A Survey on Big Data Analytics: Challenges, Open Research Issues and Tools , 2016 .

[24]  Abhishek Kumar,et al.  Evaluation of MapReduce-Based Distributed Parallel Machine Learning Algorithms , 2018 .

[25]  Sundeep Kumar Awasthi,et al.  A Survey on Big Data Analytics: Challenges , 2020 .

[26]  John Salvatier,et al.  Probabilistic programming in Python using PyMC3 , 2016, PeerJ Comput. Sci..

[27]  Chong-chong Qi Big data management in the mining industry , 2020, International Journal of Minerals, Metallurgy and Materials.

[28]  Ramzan Talib,et al.  Techniques, Tools and Applications of Graph Analytic , 2019 .

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  Thomas Mailund,et al.  HMMlib: A C++ Library for General Hidden Markov Models Exploiting Modern CPUs , 2010, 2010 Ninth International Workshop on Parallel and Distributed Methods in Verification, and Second International Workshop on High Performance Computational Systems Biology.

[31]  Abdur Rehman,et al.  Parallel Backpropagation Neural Network Training Techniques using Graphics Processing Unit , 2019 .

[32]  Jesper Nielsen,et al.  Algorithms for a Parallel Implementation of Hidden Markov Models with a Small State Space , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[33]  Kamran Raza,et al.  An Ensemble Approach to Big Data Security (Cyber Security) , 2018 .

[34]  Rong Gu,et al.  Efficient large scale distributed matrix computation with spark , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[35]  Gonzalo Mateos,et al.  Modeling and Optimization for Big Data Analytics: (Statistical) learning tools for our era of data deluge , 2014, IEEE Signal Processing Magazine.

[36]  Matei Zaharia,et al.  Matrix Computations and Optimization in Apache Spark , 2015, KDD.

[37]  Sam Kwong,et al.  Matrix-Based Evolutionary Computation , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[38]  Imad Sassi,et al.  A STUDY ON BIG DATA FRAMEWORKS AND MACHINE LEARNING TOOL KITS , 2019, Proceedings of the International Conferences Big Data Analytics, Data Mining and Computational Intelligence 2019; and Theory and Practice in Modern Computing 2019.

[39]  Mary P. Harper,et al.  Implementing a Hidden Markov Model with Duration Modeling on the MasPar MP-1 , 1994 .

[40]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[41]  Ali Mostafaeipour,et al.  Investigating the performance of Hadoop and Spark platforms on machine learning algorithms , 2020, The Journal of Supercomputing.

[42]  Ahmed Fahmy,et al.  A Parallel Fuzzy-Genetic Algorithm for Classification and Prediction , 2016 .

[43]  Tim Verbelen,et al.  A Survey on Distributed Machine Learning , 2019, ACM Comput. Surv..