Big Data Software

The advent of Big Data has created the necessity of new computing tools for processing huge amounts of data. Apache Hadoop was the first open-source framework that implemented the MapReduce paradigm. Apache Spark appeared a few years later improving the Hadoop Ecosystem. Similarly, Apache Flink appeared in the last years for tackling the Big Data streaming problem. However, as these frameworks were created for dealing with huge amounts of data, many practitioners will need machine learning algorithms for extracting the knowledge in the data. The success of a Big Data framework is going to be strongly related to its machine learning capability. This is the reason why nowadays these frameworks include a Big Data machine learning library, MLlib in the case of Spark, and FlinkML for Flink. In this chapter, we analyze in depth both MLlib and FlinkML Big Data libraries. We start with a description of Apache Spark MLlib and all of its components. We continue with a description of a Big Data library focused on data preprocessing for Apache Spark, named BigDaPSpark. Next, we provide an extensive analysis of FlinkML, and its included algorithms and utilities. Lastly, we finish with the description of a Big Data streaming library, focused on data preprocessing for Apache Flink, named BigDaPFlink.

[1]  Gilles Louppe,et al.  Independent consultant , 2013 .

[2]  Geoffrey I. Webb Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data , 2014, 2014 IEEE International Conference on Data Mining.

[3]  Francisco Herrera,et al.  A distributed evolutionary multivariate discretizer for Big Data processing on Apache Spark , 2018, Swarm Evol. Comput..

[4]  Rong Jin,et al.  Online Feature Selection and Its Applications , 2014, IEEE Transactions on Knowledge and Data Engineering.

[5]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[6]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[7]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[8]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[9]  Verónica Bolón-Canedo,et al.  Data discretization: taxonomy and big data challenge , 2016, WIREs Data Mining Knowl. Discov..

[10]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[11]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[12]  Fabrizio Angiulli,et al.  Fast Nearest Neighbor Condensation for Large Data Sets Classification , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Francisco Herrera,et al.  Big Data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce , 2018, Inf. Fusion.

[14]  W. B. Roberts,et al.  Machine Learning: The High Interest Credit Card of Technical Debt , 2014 .

[15]  Francisco Herrera,et al.  A memetic algorithm for evolutionary prototype selection: A scaling up approach , 2008, Pattern Recognit..

[16]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[17]  Francisco Herrera,et al.  DPASF: a flink library for streaming data preprocessing , 2018, Big Data Analytics.

[18]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[19]  Cheng Soon Ong,et al.  Multivariate spearman's ρ for aggregating ranks using copulas , 2016 .

[20]  Arun Sharma,et al.  Scalable machine‐learning algorithms for big data analytics: a comprehensive review , 2016, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[21]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[22]  S. García,et al.  Online entropy-based discretization for data streaming classification , 2018, Future generations computer systems.

[23]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[24]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[25]  Filiberto Pla,et al.  Prototype selection for the nearest neighbour rule through proximity graphs , 1997, Pattern Recognit. Lett..

[26]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[27]  Francisco Herrera,et al.  Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification , 2011, Pattern Recognit..

[28]  Grigorios Tsoumakas,et al.  On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams , 2005, Panhellenic Conference on Informatics.

[29]  Francisco Herrera,et al.  Enabling Smart Data: Noise filtering in Big Data classification , 2017, Inf. Sci..

[30]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[31]  Forrest W. Young,et al.  Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features , 1977 .

[32]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[33]  Verónica Bolón-Canedo,et al.  An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[34]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[35]  Mohsen Guizani,et al.  Internet of Things: A Survey on Enabling Technologies, Protocols, and Applications , 2015, IEEE Communications Surveys & Tutorials.

[36]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[37]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[38]  Robert Ivor John,et al.  An Immune-Inspired Technique to Identify Heavy Goods Vehicles Incident Hot Spots , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[39]  Francisco Herrera,et al.  Big Data Preprocessing as the Bridge between Big Data and Smart Data: BigDaPSpark and BigDaPFlink Libraries , 2019, IoTBDS.

[40]  Francisco Herrera,et al.  Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data , 2018, WIREs Data Mining Knowl. Discov..

[41]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[42]  Álvar Arnaiz-González,et al.  MR-DIS: democratic instance selection for big data by MapReduce , 2017, Progress in Artificial Intelligence.

[43]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[44]  João Gama,et al.  Discretization from data streams: applications to histograms and data mining , 2006, SAC.

[45]  Francisco Herrera,et al.  Principal Components Analysis Random Discretization Ensemble for Big Data , 2018, Knowl. Based Syst..

[46]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[47]  Francisco Herrera,et al.  SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data , 2018, J. Comput. Sci. Technol..

[48]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.