A Two-Stage Big Data Analytics Framework with Real World Applications Using Spark Machine Learning and Long Short-Term Memory Network

Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability. These datasets again impose great challenges to analytics framework and computational resources, making the overall analysis difficult for extracting meaningful information in a timely manner. Thus, to harness these kinds of challenges, developing an efficient big data analytics framework is an important research topic. Consequently, to address these challenges by exploiting non-linear relationships from very large and high-dimensional datasets, machine learning (ML) and deep learning (DL) algorithms are being used in analytics frameworks. Apache Spark has been in use as the fastest big data processing arsenal, which helps to solve iterative ML tasks, using distributed ML library called Spark MLlib. Considering real-world research problems, DL architectures such as Long Short-Term Memory (LSTM) is an effective approach to overcoming practical issues such as reduced accuracy, long-term sequence dependency, and vanishing and exploding gradient in conventional deep architectures. In this paper, we propose an efficient analytics framework, which is technically a progressive machine learning technique merged with Spark-based linear models, Multilayer Perceptron (MLP) and LSTM, using a two-stage cascade structure in order to enhance the predictive accuracy. Our proposed architecture enables us to organize big data analytics in a scalable and efficient way. To show the effectiveness of our framework, we applied the cascading structure to two different real-life datasets to solve a multiclass and a binary classification problem, respectively. Experimental results show that our analytical framework outperforms state-of-the-art approaches with a high-level of classification accuracy.

[1]  John R. Talburt,et al.  Entity Resolution Using Logistic Regression as an extension to the Rule-Based Oyster System , 2018, 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[2]  Nishchal K. Verma,et al.  Comparative analysis of Gaussian mixture model, logistic regression and random forest for big data classification using map reduce , 2016, 2016 11th International Conference on Industrial and Information Systems (ICIIS).

[3]  E. A. Zanaty,et al.  Support Vector Machines (SVMs) versus Multilayer Perception (MLP) in data classification , 2012 .

[4]  Taghi M. Khoshgoftaar,et al.  A Multi-dimensional Comparison of Toolkits for Machine Learning with Big Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[5]  LEKHA R. NAIR,et al.  STREAMING TWITTER DATA ANALYSIS USING SPARK FOR EFFECTIVE JOB SEARCH , 2015 .

[6]  Nikos Komodakis,et al.  OnionNet: Sharing Features in Cascaded Deep Classifiers , 2016, BMVC.

[7]  Hanung Adi Nugroho,et al.  Comparative study of attribute reduction on arrhythmia classification dataset , 2013, 2013 International Conference on Information Technology and Electrical Engineering (ICITEE).

[8]  Vikramaditya R. Jakkula,et al.  Tutorial on Support Vector Machine ( SVM ) , 2011 .

[9]  Fuad Rahman,et al.  A novel big-data processing framwork for healthcare applications: Big-data-healthcare-in-a-box , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[10]  Mahamudul Hasan,et al.  Two-stage Cascaded Classifier for Purchase Prediction , 2015, ArXiv.

[11]  Roger H. L. Chiang,et al.  Big Data Research in Information Systems: Toward an Inclusive Research Agenda , 2016, J. Assoc. Inf. Syst..

[12]  K. G. Srinivasa,et al.  Getting Started with Spark , 2015 .

[13]  Shoab A. Khan,et al.  Classification of Arrhythmia , 2014 .

[14]  Monika Sharma,et al.  Relative object localization using logistic regression , 2017, 2017 3rd International Conference on Advances in Computing,Communication & Automation (ICACCA) (Fall).

[15]  Gustavo E. A. P. A. Batista,et al.  Class imbalance revisited: a new experimental setup to assess the performance of treatment methods , 2014, Knowledge and Information Systems.

[16]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[17]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[18]  John Darrell Van Horn,et al.  Opinion: Big data biomedicine offers big higher education opportunities , 2016, Proceedings of the National Academy of Sciences.

[19]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[20]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[21]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[22]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.

[23]  Mojtaba Sedigh Fazli,et al.  Computational Motility Tracking of Calcium Dynamics in Toxoplasma gondii , 2017, ArXiv.

[24]  Nitin Pise,et al.  A new approach for handling imbalanced dataset using ANN and genetic algorithm , 2016, 2016 International Conference on Communication and Signal Processing (ICCSP).

[25]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[26]  Syed Muhammad Anwar,et al.  Wrapper method for feature selection to classify cardiac arrhythmia , 2017, 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[27]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.

[28]  H. A. Guvenir,et al.  A supervised machine learning algorithm for arrhythmia analysis , 1997, Computers in Cardiology 1997.

[29]  Ernesto Damiani,et al.  Privacy-aware Big Data Analytics as a service for public health policies in smart cities , 2018 .

[30]  Arslan Shaukat,et al.  Identifying best feature subset for cardiac arrhythmia classification , 2015, 2015 Science and Information Conference (SAI).

[31]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[32]  Dietrich Rebholz-Schuhmann,et al.  Recurrent Deep Embedding Networks for Genotype Clustering and Ethnicity Prediction , 2018, ArXiv.

[33]  Chetan Sharma,et al.  Big Data Analytics Using Neural networks , 2014 .

[34]  Madalina Cosmina Popescu,et al.  Feature extraction, feature selection and machine learning for image classification: A case study , 2014, 2014 International Conference on Optimization of Electrical and Electronic Equipment (OPTIM).

[35]  Lekha R. Nair,et al.  Applying spark based machine learning model on streaming big data for health status prediction , 2017, Comput. Electr. Eng..

[36]  M. Anwar Ma'sum,et al.  Processing big data with decision trees: A case study in large traffic data , 2016, 2016 International Workshop on Big Data and Information Security (IWBIS).

[37]  Jian Fu,et al.  SPARK – A Big Data Processing Platform for Machine Learning , 2016, 2016 International Conference on Industrial Informatics - Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII).

[38]  K. P. Soman,et al.  Apache Spark a Big Data Analytics Platform for Smart Grid , 2015 .

[39]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[40]  E. A. Mary Anita,et al.  Interactive Big Data Management in Healthcare Using Spark , 2016 .

[41]  Sabrina De Capitani di Vimercati,et al.  Cloud technology options towards Free Flow of Data , 2017 .

[42]  R. Brereton,et al.  Support vector machines for classification and regression. , 2010, The Analyst.

[43]  Tariq Rahim Soomro,et al.  Big Data Analysis: Apache Spark Perspective , 2015 .

[44]  Sonali Agarwal,et al.  A Map Reduce based Support Vector Machine for Big Data Classification , 2015 .

[45]  P. Bobbie,et al.  Classification of Arrhythmia Using Machine Learning Techniques , 2005 .

[46]  Chau Yuen,et al.  Sensor Fusion for Public Space Utilization Monitoring in a Smart City , 2017, IEEE Internet of Things Journal.

[47]  Niloofar Yousefi,et al.  Multi-Task Learning with Group-Specific Feature Space Sharing , 2015, ECML/PKDD.

[48]  Giannis Tzimas,et al.  Large Scale Sentiment Analysis on Twitter with Spark , 2016, EDBT/ICDT Workshops.

[49]  Peggy L. Peissig,et al.  Machine Learning-as-a-Service and Its Application to Medical Informatics , 2017, MLDM.

[50]  Dimitrios Tsoumakos,et al.  A decision tree based approach towards adaptive modeling of big data applications , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[51]  Frank A. Farris The Gini Index and Measures of Inequality , 2010, Am. Math. Mon..

[52]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[53]  Divya Tomar,et al.  A comparison on multi-class classification methods based on least squares twin support vector machine , 2015, Knowl. Based Syst..

[54]  Luís A. Alexandre,et al.  Data classification with multilayer perceptrons using a generalized error function , 2008, Neural Networks.

[55]  Hafid Barka,et al.  Big Data: Framework and issues , 2016, 2016 International Conference on Electrical and Information Technologies (ICEIT).

[56]  Reynold Xin,et al.  Apache Spark , 2016 .

[57]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[58]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[59]  Xiaoqian Jiang,et al.  SecureLR: Secure Logistic Regression Model via a Hybrid Cryptographic Protocol , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[60]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[61]  Yisheng Lv,et al.  Short-term traffic flow prediction with LSTM recurrent neural network , 2017, 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC).

[62]  Guangchi Liu,et al.  Big data machine learning using apache spark MLlib , 2017, 2017 IEEE International Conference on Big Data (Big Data).