Adaptation of Classical Machine Learning Algorithms to Big Data Context: Problems and Challenges : Case Study: Hidden Markov Models Under Spark

Big Data Analytics presents a great opportunity for scientists and businesses. It changed the methods of managing and analyzing the huge amount of data. To make big data valuable, we often use Machine Learning algorithms. Indeed, these algorithms have shown, in the past, their processing speed, efficiency and accuracy. But today, with the complex characteristics of big data, new problems have emerged and we are facing new challenges when developing and designing a new Machine Learning algorithm for Big Data Analytics. Therefore, it is essential to review the classical algorithms to adapt them to this new context. One of the methods of adaptation is the coupling between new technologies (i.e., distributed computing by GPU, Hadoop, Spark) and the Machine Learning algorithms to reduce the computational cost of data analysis. This paper highlights main challenges of adaptation of Machine Learning algorithms to the Big Data context and describes a novel method to make these algorithms efficient and fast in Big Data processing by taking as a case study the Hidden Markov Models using Spark framework. The results of complexity comparison of classical algorithms and those adapted to the Big Data context using Spark show a great improvement.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Seref Sagiroglu,et al.  Big data: A review , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[4]  Avita Katal,et al.  Big data: Issues, challenges, tools and Good practices , 2013, 2013 Sixth International Conference on Contemporary Computing (IC3).

[5]  Rong Gu,et al.  Efficient large scale distributed matrix computation with spark , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[6]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[7]  Isaac Ruiz,et al.  The Manager: Apache Mesos , 2016 .

[8]  J. Alberto Espinosa,et al.  Big Data: Issues and Challenges Moving Forward , 2013, 2013 46th Hawaii International Conference on System Sciences.

[9]  Imad Sassi,et al.  An Overview of Big Data and Machine Learning Paradigms , 2018, Advances in Intelligent Systems and Computing.

[10]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[11]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[12]  Imad Sassi,et al.  A STUDY ON BIG DATA FRAMEWORKS AND MACHINE LEARNING TOOL KITS , 2019, Proceedings of the International Conferences Big Data Analytics, Data Mining and Computational Intelligence 2019; and Theory and Practice in Modern Computing 2019.

[13]  Lain L. MacDonald,et al.  Hidden Markov and Other Models for Discrete- valued Time Series , 1997 .

[14]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[15]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[16]  Biing-Hwang Juang,et al.  Mixture autoregressive hidden Markov models for speech signals , 1985, IEEE Trans. Acoust. Speech Signal Process..

[17]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[18]  T. Bayes LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a letter to John Canton, A. M. F. R. S , 1763, Philosophical Transactions of the Royal Society of London.

[19]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[20]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[21]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[22]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[23]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[24]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[26]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[27]  John B. Moore,et al.  Hidden Markov Models: Estimation and Control , 1994 .

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30]  Matei Zaharia,et al.  Matrix Computations and Optimization in Apache Spark , 2015, KDD.