Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach

Nowadays, many disciplines have to deal with big datasets that additionally involve a high number of features. Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. However, traditional methods lack enough scalability to cope with datasets of millions of instances and extract successful results in a delimited time. This paper presents a feature selection algorithm based on evolutionary computation that uses the MapReduce paradigm to obtain subsets of features from big datasets. The algorithm decomposes the original dataset in blocks of instances to learn from them in the map phase; then, the reduce phase merges the obtained partial results into a final vector of feature weights, which allows a flexible application of the feature selection procedure using a threshold to determine the selected subset of features. The feature selection method is evaluated by using three well-known classifiers (SVM, Logistic Regression, and Naive Bayes) implemented within the Spark framework to address big data problems. In the experiments, datasets up to 67 millions of instances and up to 2000 attributes have been managed, showing that this is a suitable framework to perform evolutionary feature selection, improving both the classification accuracy and its runtime when dealing with big data problems.

[1]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[2]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  Xavier Llorà,et al.  Large‐scale data mining using genetics‐based machine learning , 2013, GECCO.

[5]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[6]  José Francisco Martínez Trinidad,et al.  A review of instance selection methods , 2010, Artificial Intelligence Review.

[7]  José Salvador Sánchez,et al.  High training set size reduction by space partitioning and prototype abstraction , 2004, Pattern Recognit..

[8]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2007 .

[9]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[11]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[12]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[13]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[14]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[15]  Ashwin Srinivasan,et al.  Data and task parallelism in ILP using MapReduce , 2011, Machine Learning.

[16]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[17]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[18]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Emanuela Merelli,et al.  Topology driven modeling: the IS metaphor , 2014, Natural Computing.

[20]  Helmut Krcmar,et al.  Big Data , 2014, Wirtschaftsinf..

[21]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[22]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[23]  Hiroshi Motoda,et al.  Book Review: Computational Methods of Feature Selection , 2007, The IEEE intelligent informatics bulletin.

[24]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[25]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[26]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[27]  Beatriz de la Iglesia,et al.  Evolutionary computation for feature selection in classification problems , 2013, WIREs Data Mining Knowl. Discov..

[28]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[29]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[30]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[33]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[34]  Daniel M. Batista,et al.  A Survey of Large Scale Data Management Approaches in Cloud Environments , 2011, IEEE Communications Surveys & Tutorials.

[35]  GhemawatSanjay,et al.  The Google file system , 2003 .

[36]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[37]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[38]  Ivor W. Tsang,et al.  Towards ultrahigh dimensional feature selection for big data , 2012, J. Mach. Learn. Res..

[39]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[40]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[41]  Michael Minelli,et al.  Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses , 2012 .

[42]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[43]  Larry J. Eshelman The CHC Adaptive Search Algo-rithm , 1991 .