Big data machine learning using apache spark MLlib

Artificial intelligence, and particularly machine learning, has been used in many ways by the research community to turn a variety of diverse and even heterogeneous data sources into high quality facts and knowledge, providing premier capabilities to accurate pattern discovery. However, applying machine learning strategies on big and complex datasets is computationally expensive, and it consumes a very large amount of logical and physical resources, such as data file space, CPU, and memory. A sophisticated platform for efficient big data analytics is becoming more important these days as the data amount generated in a daily basis exceeds over quintillion bytes. Apache Spark MLlib is one of the most prominent platforms for big data analysis which offers a set of excellent functionalities for different machine learning tasks ranging from regression, classification, and dimension reduction to clustering and rule extraction. In this contribution, we explore, from the computational perspective, the expanding body of the Apache Spark MLlib 2.0 as an open-source, distributed, scalable, and platform independent machine learning library. Specifically, we perform several real world machine learning experiments to examine the qualitative and quantitative attributes of the platform. Furthermore, we highlight current trends in big data machine learning research and provide insights for future work.

[1]  Chunming Rong,et al.  Using Mahout for Clustering Wikipedia's Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[2]  Marco Masseroli,et al.  GenoMetric Query Language: a novel approach to large-scale genomic data management , 2015, Bioinform..

[3]  Eric R. LaRose,et al.  Adverse Drug Event Discovery Using Biomedical Literature: A Big Data Neural Network Adventure , 2017, JMIR medical informatics.

[4]  Zhan Ye,et al.  SparkText: Biomedical Text Mining on Big Data Framework , 2016, PloS one.

[5]  Aitor García Pablos,et al.  V3: Unsupervised Aspect Based Sentiment Analysis for SemEval2015 Task 12 , 2015, *SEMEVAL.

[6]  Zhihan Lv,et al.  Next-Generation Big Data Analytics: State of the Art, Challenges, and Future Research Topics , 2017, IEEE Transactions on Industrial Informatics.

[7]  Yaohang Li,et al.  An Apache Spark Implementation of Block Power Method for Computing Dominant Eigenvalues and Eigenvectors of Large-Scale Matrices , 2016, 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom).

[8]  Muthahar Syed Using apache spark for scalable gene sequence analysis , 2016 .

[9]  Mojtaba Sedigh Fazli,et al.  Computational Motility Tracking of Calcium Dynamics in Toxoplasma gondii , 2017, ArXiv.

[10]  M. Pusic,et al.  A Big Data and Learning Analytics Approach to Process-Level Feedback in Cognitive Simulations , 2017, Academic medicine : journal of the Association of American Medical Colleges.

[11]  Cheng-Hao Tsai,et al.  Large-scale logistic regression and linear support vector machines using spark , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[12]  Marek S. Wiewiórka,et al.  Scalable Framework for the Analysis of Population Structure Using the Next Generation Sequencing Data , 2017, ISMIS.

[13]  John D. Van Horn,et al.  Opinion: Big data biomedicine offers big higher education opportunities , 2016, Proc. Natl. Acad. Sci. USA.

[14]  Tariq Rahim Soomro,et al.  Big Data Analysis: Apache Spark Perspective , 2015 .

[15]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[16]  Peggy L. Peissig,et al.  Machine Learning-as-a-Service and Its Application to Medical Informatics , 2017, MLDM.

[17]  Harish S. Bhat,et al.  Scalable SDE Filtering and Inference with Apache Spark , 2016, BigMine.

[18]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[19]  Divyakant Agrawal,et al.  Big data and cloud computing: current state and future opportunities , 2011, EDBT/ICDT '11.

[20]  Zhihan Lv,et al.  Bigdata Oriented Multimedia Mobile Health Applications , 2016, Journal of Medical Systems.

[21]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[22]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[23]  Pierre Baldi,et al.  Parameterized Machine Learning for High-Energy Physics , 2016, ArXiv.

[24]  Marcos Barreto,et al.  A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data , 2015, EDBT/ICDT Workshops.

[25]  Roger H. L. Chiang,et al.  Big Data Research in Information Systems: Toward an Inclusive Research Agenda , 2016, J. Assoc. Inf. Syst..

[26]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[27]  Shefali Arora Analyzing mobile phone usage using clustering in Spark MLLib and Pig , 2017 .

[28]  Tim Kraska,et al.  TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries , 2015, ArXiv.

[29]  Reynold Xin,et al.  Apache Spark , 2016 .

[30]  Mikkel Baun Kjærgaard,et al.  Smart Devices are Different: Assessing and MitigatingMobile Sensing Heterogeneities for Activity Recognition , 2015, SenSys.

[31]  Niloofar Yousefi,et al.  Multi-Task Learning with Group-Specific Feature Space Sharing , 2015, ECML/PKDD.

[32]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[33]  Young-Koo Lee,et al.  Human Action Recognition Using Adaptive Local Motion Descriptor in Spark , 2017, IEEE Access.

[34]  Ali Tizghadam,et al.  Application Platform for Smart Transportation , 2015, FABULOUS.

[35]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[36]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[37]  Choon-Sung Nam,et al.  Design of educational big data application using spark , 2017, 2017 19th International Conference on Advanced Communication Technology (ICACT).

[38]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[39]  Chen Feng,et al.  Performance Benefits of DataMPI: A Case Study with BigDataBench , 2014, BPOE@ASPLOS/VLDB.

[40]  Shu-Ching Chen,et al.  Computational Health Informatics in the Big Data Age , 2016, ACM Comput. Surv..

[41]  Zhengli Liang,et al.  Design and Implementation of Smart City Big Data Processing Platform Based on Distributed Architecture , 2015, 2015 10th International Conference on Intelligent Systems and Knowledge Engineering (ISKE).

[42]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.

[43]  K. P. Soman,et al.  Apache Spark a Big Data Analytics Platform for Smart Grid , 2015 .

[44]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[45]  Tim Kraska,et al.  Automating model search for large scale machine learning , 2015, SoCC.

[46]  Rohan Arora,et al.  Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means , 2015 .

[47]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[48]  E. A. Mary Anita,et al.  Interactive Big Data Management in Healthcare Using Spark , 2016 .

[49]  Dongliang Ding,et al.  An overview on cloud computing platform spark for Human Genome mining , 2016, 2016 IEEE International Conference on Mechatronics and Automation.

[50]  Xiaolin Li,et al.  Advanced Computational Infrastructures for Parallel and Distributed Applications , 2009 .

[51]  David Page,et al.  bigNN: An open-source big data toolkit focused on biomedical sentence classification , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[52]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[53]  Reynold Xin,et al.  Scaling Spark in the Real World: Performance and Usability , 2015, Proc. VLDB Endow..