Dimensionality Reduction for Big Data

In the new era of Big Data, exponential increase in volume is usually accompanied by an explosion in the number of features. Dimensionality reduction arises as a possible solution to enable large-scale learning with millions of dimensions. Nevertheless, as any other family of algorithms, reduction methods require an upgrade in its design so that they can work with such magnitudes. Particularly, they must be prepared to tackle the explosive combinatorial effects of “the curse of Big Dimensionality” while embracing the benefits from the “blessing side of dimensionality” (poorly correlated features). In this chapter we analyze the problems and benefits derived from “the curse of Big Dimensionality”, and how this problem has spread around many fields like life sciences or the Internet. Then we list all the contributions that address the large-scale dimensionality reduction problem. Next, and as a case study, we study in depth the design and behavior of one of the most popular selection frameworks in this field. Finally, we study all contributions related to dimensionality reduction in Big Data streams.

[1]  Mark Michael,et al.  Experimental Study of Information Measure and Inter-Intra Class Distance Ratios on Feature Selection and Orderings , 1973, IEEE Trans. Syst. Man Cybern..

[2]  Ivor W. Tsang,et al.  Towards ultrahigh dimensional feature selection for big data , 2012, J. Mach. Learn. Res..

[3]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[4]  Zheng Zhao,et al.  Massively parallel feature selection: an approach based on variance preservation , 2012, Machine Learning.

[5]  Verónica Bolón-Canedo,et al.  Recent advances and emerging challenges of feature selection in the context of big data , 2015, Knowl. Based Syst..

[6]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[9]  K. R. Chandran,et al.  An enhanced ACO algorithm to select features for text categorization and its parallelization , 2012, Expert Syst. Appl..

[10]  Weiwei Xing,et al.  A parallel feature selection method study for text classification , 2016, Neural Computing and Applications.

[11]  Zhao Li,et al.  Data intensive parallel feature selection method study , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[12]  Santanu Kumar Rath,et al.  Classification of microarray using MapReduce based proximal support vector machine classifier , 2015, Knowl. Based Syst..

[13]  Rong Jin,et al.  Online Feature Selection and Its Applications , 2014, IEEE Transactions on Knowledge and Data Engineering.

[14]  Ivor W. Tsang,et al.  Discovering Support and Affiliated Features from Very High Dimensions , 2012, ICML.

[15]  Leon Wenliang Zhong,et al.  Efficient Sparse Modeling With Automatic Feature Grouping , 2011, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Fuzhen Zhuang,et al.  Parallel feature selection using positive approximation based on MapReduce , 2014, 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[17]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Francisco Herrera,et al.  DPASF: a flink library for streaming data preprocessing , 2018, Big Data Analytics.

[19]  Nadia Essoussi,et al.  A Parallel Implementation of Relief Algorithm Using Mapreduce Paradigm , 2016, ICCCI.

[20]  Daniel E. O'Leary,et al.  Artificial Intelligence and Big Data , 2013, IEEE Intelligent Systems.

[21]  Jeremy Kubica,et al.  Parallel Large Scale Feature Selection for Logistic Regression , 2009, SDM.

[22]  Ivor W. Tsang,et al.  The Emerging "Big Dimensionality" , 2014, IEEE Computational Intelligence Magazine.

[23]  Jim Austin,et al.  Hadoop neural network for parallel and distributed feature selection , 2016, Neural Networks.

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Ivor W. Tsang,et al.  Efficient Multitemplate Learning for Structured Prediction , 2011, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[27]  Wu Bin,et al.  Design and Implementation of Parallel Term Contribution Algorithm Based on Mapreduce Model , 2012, 2012 7th Open Cirrus Summit.

[28]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[29]  Feiping Nie,et al.  Trace Ratio Criterion for Feature Selection , 2008, AAAI.

[30]  Glenn Fung,et al.  A Feature Selection Newton Method for Support Vector Machine Classification , 2004, Comput. Optim. Appl..

[31]  Tiranee Achalakul,et al.  Feature Reduction for Anomaly Detection in Manufacturing with MapReduce GA/kNN , 2013, ICPADS 2013.

[32]  Verónica Bolón-Canedo,et al.  Feature selection for high-dimensional data , 2016, Progress in Artificial Intelligence.

[33]  Charu C. Aggarwal,et al.  Data Mining: The Textbook , 2015 .

[34]  Kai Chen,et al.  Differentially private feature selection under MapReduce framework , 2013 .

[35]  Huan Liu,et al.  Challenges of Feature Selection for Big Data Analytics , 2016, IEEE Intelligent Systems.

[36]  Grigorios Tsoumakas,et al.  On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams , 2005, Panhellenic Conference on Informatics.

[37]  Alberto Mozo,et al.  Massively Parallel Unsupervised Feature Selection on Spark , 2015, ADBIS.

[38]  Verónica Bolón-Canedo,et al.  An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[39]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[40]  Sergio Ramírez-Gallego,et al.  Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach , 2015 .

[41]  Manesh Dalavi,et al.  Hadoop MapReduce implementation of a novel scheme for term weighting in text categorization , 2014, 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT).

[42]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[43]  Sonja Filiposka,et al.  Feature Ranking Based on Information Gain for Large Classification Problems with MapReduce , 2015, TrustCom 2015.

[44]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[45]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..