Distributed Entropy Minimization Discretizer for Big Data Analysis under Apache Spark

The astonishing rate of data generation on the Internet nowadays has caused that many classical knowledge extraction techniques have become obsolete. Data reduction techniques are required in order to reduce the complexity order held by these techniques. Among reduction techniques, discretization is one of the most important tasks in data mining process, aimed at simplifying and reducing continuous-valued data in large datasets. In spite of the great interest in this reduction mechanism, only a few simple discretization techniques have been implemented in the literature for Big Data. Thereby we propose a distributed implementation of the entropy minimization discretizer proposed by Fayyad and Irani using Apache Spark platform. Our solution goes beyond a simple parallelization, transforming the iterativity yielded by the original proposal in a single-step computation. Experimental results on two large-scale datasets show that our solution is able to improve the classification accuracy as well as boosting the underlying learning process.

[1]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[2]  Michael Minelli,et al.  Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses , 2012 .

[3]  Jimmy J. Lin MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail! , 2012, Big Data.

[4]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[7]  Yen-Liang Chen,et al.  A Dynamic Discretization Approach for Constructing Decision Trees with a Continuous Label , 2009, IEEE Transactions on Knowledge and Data Engineering.

[8]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[9]  Yen-Liang Chen,et al.  A Novel Decision-Tree Method for Structured Continuous-Label Classification , 2013, IEEE Transactions on Cybernetics.

[10]  Xindong Wu,et al.  Discretization Methods , 2010, Data Mining and Knowledge Discovery Handbook.

[11]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[12]  Tapio Elomaa,et al.  General and Efficient Multisplitting of Numerical Attributes , 1999, Machine Learning.

[13]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[16]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[17]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  Bogdan S. Chlebus On the Klee's Measure Problem in Small Dimensions , 1998, SOFSEM.

[20]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.