Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms

Big Data has been a term used in datasets which are complex and large in such a way there are some traditional technologies of data processing which are not adequate. Big Data can revolutionize most aspects in society such as collection or management of data from Big Data which is challenging and also very complex. The Hadoop has been designed for processing a large amount of unstructured and complex data. It has provided with a large amount of storage for data along with the ability to be able to tackle unlimited and concurrent tasks or jobs. The selection of features is an extremely powerful technique in the reduction of dimensionality and is also the most important step in machine learning applications. In recent decades, data is getting larger in a progressive manner in terms of instances and numbers making it very hard to deal with the problem of feature selection. In order to cope with such an epoch of Big Data, there are some more new techniques that are required to address the problem in a more efficient manner. At the same time, the suitability of the algorithms currently used may not be applicable especially when the size of data is above hundreds of gigabytes. For the purpose of this work, the correlation-based feature selection along with mutual information-based methods of feature selection was used for improving the performance. The AdaBoost and the support vector machine based classifiers have been used for improving their accuracy. The results of the experiment prove that the method proposed was able to achieve better performance compared to that of the other methods.

[1]  Nang Saing Moon Kham,et al.  Mutual Information-based Feature Selection Approach to Reduce High Dimension of Big Data , 2018, ICML 2018.

[2]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[3]  Sachchidanand Singh,et al.  Big Data analytics , 2012 .

[4]  Kartik Shankar,et al.  Random forest for big data classification in the internet of things using optimal features , 2019, Int. J. Mach. Learn. Cybern..

[5]  Verónica Bolón-Canedo,et al.  An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[6]  Jim Austin,et al.  Hadoop neural network for parallel and distributed feature selection , 2016, Neural Networks.

[7]  Francisco Herrera,et al.  kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..

[8]  Bobby D. Gerardo,et al.  Implementing Enhanced AdaBoost Algorithm for Sales Classification and Prediction , 2017 .

[9]  Shuai Li,et al.  A MapReduce based parallel SVM for large-scale predicting protein-protein interactions , 2014, Neurocomputing.

[10]  Sonali Agarwal,et al.  A Map Reduce based Support Vector Machine for Big Data Classification , 2015 .

[11]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.

[12]  Mohammad Saniee Abadeh,et al.  A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data , 2018, Cluster Computing.

[13]  Sonja Filiposka,et al.  Feature Ranking Based on Information Gain for Large Classification Problems with MapReduce , 2015, TrustCom 2015.

[14]  Raul Queiroz Feitosa,et al.  CLASSIFICATION ALGORITHMS FOR BIG DATA ANALYSIS, A MAP REDUCE APPROACH , 2015, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences.

[15]  J. Hossen,et al.  Modifying Cleaning Method in Big Data Analytics Process using Random Forest Classifier , 2018, 2018 7th International Conference on Computer and Communication Engineering (ICCCE).

[16]  Yong Wang,et al.  A Feature Selection Method for Large-Scale Network Traffic Classification Based on Spark , 2016, Inf..

[17]  Amparo Alonso-Betanzos,et al.  Distributed Correlation-Based Feature Selection in Spark , 2019, Inf. Sci..

[18]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[19]  Lin Guo,et al.  Sensitive Data Detection Using NN and KNN from Big Data , 2018, ICA3PP.

[20]  Michel Verleysen,et al.  Feature Selection with Mutual Information for Uncertain Data , 2011, DaWaK.

[21]  Raj Kumar,et al.  Classification Algorithms for Data Mining: a Survey , 2022 .

[22]  Nima Jafari Navimipour,et al.  A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop , 2019, J. Netw. Comput. Appl..