On the Use of Random Discretization and Dimensionality Reduction in Ensembles for Big Data

Massive data growth in recent years has made data reduction techniques to gain a special popularity because of their ability to reduce this enormous amount of data, also called Big Data. Random Projection Random Discretization is an innovative ensemble method. It uses two data reduction techniques to create more informative data, their proposed Random Discretization, and Random Projections (RP). However, RP has some shortcomings that can be solved by more powerful methods such as Principal Components Analysis (PCA). Aiming to tackle this problem, we propose a new ensemble method using the Apache Spark framework and PCA for dimensionality reduction, named Random Discretization Dimensionality Reduction Ensemble. In our experiments on five Big Data datasets, we show that our proposal achieves better prediction performance than the original algorithm and Random Forest.

[1]  Francisco Herrera,et al.  A comparison on scalability for batch big data processing on Apache Spark and Apache Flink , 2017 .

[2]  Jimmy J. Lin MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail! , 2012, Big Data.

[3]  Francisco Herrera,et al.  A distributed evolutionary multivariate discretizer for Big Data processing on Apache Spark , 2018, Swarm Evol. Comput..

[4]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[8]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[9]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[10]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[11]  Francisco Herrera,et al.  Big Data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce , 2018, Inf. Fusion.

[12]  Francisco Herrera,et al.  Big data preprocessing: methods and prospects , 2016 .

[13]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[14]  Francisco Herrera,et al.  Tutorial on practical tips of the most influential data preprocessing algorithms in data mining , 2016, Knowl. Based Syst..

[15]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[16]  Gavin Brown,et al.  Random Projection Random Discretization Ensembles—Ensembles of Linear Multivariate Decision Trees , 2014, IEEE Transactions on Knowledge and Data Engineering.

[17]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.