DPASF: a flink library for streaming data preprocessing

BackgroundData preprocessing techniques are devoted to correcting or alleviating errors in data. Discretization and feature selection are two of the most extended data preprocessing techniques. Although we can find many proposals for static Big Data preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is a recent and novel Big Data framework, following the MapReduce paradigm, focused on distributed stream and batch data processing.In this paper, we propose a data stream library for Big Data preprocessing, named DPASF, under Apache Flink. The library is composed of six of the most popular and widely used data preprocessing algorithms. It contains three algorithms for discretization, and three algorithms for performing feature selection.ResultsThe algorithms have been tested using two Big Data datasets. Experimental results show that preprocessing can not only reduce the size of the data, but also maintain or even improve the original accuracy in a short period of time.ConclusionDPASF contains algorithms that are useful when dealing with Big Data data streams. The preprocessing algorithms included in the library are able to tackle Big Datasets efficiently and to correct imperfections in the data.

[1]  Taghi M. Khoshgoftaar,et al.  A survey of open source tools for machine learning with big data in the Hadoop ecosystem , 2015, Journal of Big Data.

[2]  Fahad Saeed,et al.  Towards quantifying psychiatric diagnosis using machine learning algorithms and big fMRI data , 2018, Big Data Analytics.

[3]  Verónica Bolón-Canedo,et al.  Data discretization: taxonomy and big data challenge , 2016, WIREs Data Mining Knowl. Discov..

[4]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[5]  William H. Press,et al.  Numerical recipes in C , 2002 .

[6]  Francisco Herrera,et al.  Tutorial on practical tips of the most influential data preprocessing algorithms in data mining , 2016, Knowl. Based Syst..

[7]  Jan van Leeuwen,et al.  Interval Heaps , 1993, Comput. J..

[8]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[9]  Rong Jin,et al.  Online Feature Selection and Its Applications , 2014, IEEE Transactions on Knowledge and Data Engineering.

[10]  João Gama,et al.  Discretization from data streams: applications to histograms and data mining , 2006, SAC.

[11]  Francisco Herrera,et al.  Enabling Smart Data: Noise filtering in Big Data classification , 2017, Inf. Sci..

[12]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[13]  Sabine Loudcher,et al.  FUSINTER: A Method for Discretization of Continuous Attributes , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[14]  Usama M. Fayyad,et al.  On the Handling of Continuous-Valued Attributes in Decision Tree Generation , 1992, Machine Learning.

[15]  Grigorios Tsoumakas,et al.  On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams , 2005, Panhellenic Conference on Informatics.

[16]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[17]  Francisco Herrera,et al.  Principal Components Analysis Random Discretization Ensemble for Big Data , 2018, Knowl. Based Syst..

[18]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[19]  Francisco Herrera,et al.  Big data preprocessing: methods and prospects , 2016 .

[20]  Kostas Tzoumas,et al.  Introduction to Apache Flink: Stream Processing for Real Time and Beyond , 2016 .

[21]  Verónica Bolón-Canedo,et al.  An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[22]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[23]  Francisco Herrera,et al.  Big Data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce , 2018, Inf. Fusion.

[24]  Geoffrey I. Webb Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data , 2014, 2014 IEEE International Conference on Data Mining.

[25]  Francisco Herrera,et al.  A survey on data preprocessing for data stream mining: Current status and future directions , 2017, Neurocomputing.

[26]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[27]  Francisco Herrera,et al.  A comparison on scalability for batch big data processing on Apache Spark and Apache Flink , 2017 .

[28]  S. García,et al.  Online entropy-based discretization for data streaming classification , 2018, Future generations computer systems.

[29]  Pabitra Mitra,et al.  The big data system, components, tools, and technologies: a survey , 2018, Knowledge and Information Systems.

[30]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .