Scalable Feature Selection via Distributed Diversity Maximization

Feature selection is a fundamental problem in machine learning and data mining. The majority of feature selection algorithms are designed for running on a single machine (centralized setting) and they are less applicable to very large datasets. Although there are some distributed methods to tackle this problem, most of them are distributing the data horizontally which are not suitable for datasets with a large number of features and few number of instances. Thus, in this paper, we introduce a novel vertically distributable feature selection method in order to speed up this process and be able to handle very large datasets in a scalable manner. In general, feature selection methods aim at selecting relevant and non-redundant features (Minimum Redundancy and Maximum Relevance). It is much harder to consider redundancy in a vertically distributed setting than a centralized setting since there is no global access to the whole data. To the best of our knowledge, this is the first attempt toward solving the feature selection problem with a vertically distributed filter method which handles the redundancy with consistently comparable results with centralized methods. In this paper, we formalize the feature selection problem as a diversity maximization problem by introducing a mutual-information-based metric distance on the features. We show the effectiveness of our method by performing an extensive empirical study. In particular, we show that our distributed method outperforms state-of-the-art centralized feature selection algorithms on a variety of datasets. From a theoretical point of view, we have proved that the used greedy algorithm in our method achieves an approximation factor of 1/4 for the diversity maximization problem in a distributed setting with high probability. Furthermore, we improve this to 8/25 expected approximation using multiplicity in our distribution.

[1]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[2]  Verónica Bolón-Canedo,et al.  Distributed feature selection: An application to microarray data classification , 2015, Appl. Soft Comput..

[3]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[4]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[5]  Hamid Zarrabi-Zadeh,et al.  Diversity Maximization via Composable Coresets , 2015, CCCG.

[6]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[7]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[8]  James Bailey,et al.  Effective global approaches for mutual information based feature selection , 2014, KDD.

[9]  Sergio Ramírez-Gallego,et al.  Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach , 2015 .

[10]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[11]  Verónica Bolón-Canedo,et al.  A Time Efficient Approach for Distributed Feature Selection Partitioning by Features , 2015, CAEPIA.

[12]  Verónica Bolón-Canedo,et al.  Recent advances and emerging challenges of feature selection in the context of big data , 2015, Knowl. Based Syst..

[13]  Refael Hassin,et al.  Approximation algorithms for maximum dispersion , 1997, Oper. Res. Lett..

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  N. Turk-Browne Functional Interactions as Big Data in the Human Brain , 2013, Science.

[16]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[17]  Charles Elkan,et al.  Quadratic Programming Feature Selection , 2010, J. Mach. Learn. Res..

[18]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[19]  Satoru Miyano,et al.  A Top-r Feature Selection Algorithm for Microarray Gene Expression Data , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Benjamin E. Birnbaum,et al.  An Improved Analysis for a Greedy Remote-Clique Algorithm Using Factor-Revealing LPs , 2006, Algorithmica.

[21]  Zheng Zhao,et al.  Massively parallel feature selection: an approach based on variance preservation , 2012, Machine Learning.

[22]  Morteza Zadimoghaddam,et al.  Randomized Composable Core-sets for Distributed Submodular Maximization , 2015, STOC.

[23]  Aditya Bhaskara,et al.  Linear Relaxations for Finding Diverse Elements in Metric Spaces , 2016, NIPS.

[24]  Vahab S. Mirrokni,et al.  Composable core-sets for diversity and coverage maximization , 2014, PODS.

[25]  Wei Liu,et al.  Conditional Mutual Information Based Feature Selection , 2008, 2008 International Symposium on Knowledge Acquisition and Modeling.

[26]  Jie Tan,et al.  Big Data Bioinformatics , 2014, Journal of cellular physiology.

[27]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[28]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[29]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[30]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Vahab S. Mirrokni,et al.  Diversity maximization under matroid constraints , 2013, KDD.