A Parallel Implementation of Information Gain Using Hive in Conjunction with MapReduce for Continuous Features

Finding efficient ways to perform the Information Gain algorithm is becoming even more important as we enter the Big Data era where data and dimensionality are increasing at alarming rates. When machine learning algorithms get over-burdened with large dimensional data with redundant features, information gain becomes very crucial for feature selection. Information gain is also often used as a pre-cursory step in creating decision trees, text classifiers, support vector machines, etc. Due to the very large volume of today’s data, there is a need to efficiently parallelize classic algorithms like Information Gain. In this paper, we present a parallel implementation of Information Gain in the MapReduce environment, using MapReduce in conjunction with Hive, for continuous features. In our approach, Hive was used to calculate the counts and parent entropy and a Map only job was used to complete the Information Gain calculations. Our approach demonstrated gains in run times as we carefully designed MapReduce jobs efficiently leveraging the Hadoop cluster.

[1]  Vipin Kumar,et al.  Feature Selection: A literature Review , 2014, Smart Comput. Rev..

[2]  Ignacio Rojas,et al.  Efficient Parallel Feature Selection for Steganography Problems , 2009, IWANN.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Nandita Yambem,et al.  A Survey on Data Mining Algorithms on Apache Hadoop Platform , 2014 .

[7]  Sonja Filiposka,et al.  Parallel computation of information gain using Hadoop and MapReduce , 2015, 2015 Federated Conference on Computer Science and Information Systems (FedCSIS).

[8]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[9]  Swarun Kumar,et al.  LTE radio analytics made easy and accessible , 2015, SIGCOMM 2015.

[10]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[11]  Wei Dai,et al.  A MapReduce Implementation of C4.5 Decision Tree Algorithm , 2014 .

[12]  Wei-Tsong Lee,et al.  Adaptive Combiner for MapReduce on cloud computing , 2014, Cluster Computing.

[13]  Vipin Kumar,et al.  Poem Classification Using Machine Learning Approach , 2012, SocProS.

[14]  Jeremy Kubica,et al.  Parallel Large Scale Feature Selection for Logistic Regression , 2009, SDM.

[15]  Huaiqing Wang,et al.  A discretization algorithm based on a heterogeneity criterion , 2005, IEEE Transactions on Knowledge and Data Engineering.

[16]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[17]  Sherif Sakr,et al.  The family of mapreduce and large-scale data processing systems , 2013, CSUR.

[18]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Sikha Bagui,et al.  A review of data mining algorithms on Hadoop's MapReduce , 2018 .

[20]  S. Minz,et al.  Mood classifiaction of lyrics using SentiWordNet , 2013, 2013 International Conference on Computer Communication and Informatics.

[21]  Sean Owen,et al.  Mahout in Action , 2011 .

[22]  Zhao Li,et al.  Data intensive parallel feature selection method study , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).