A Distributed Feature Selection Approach Based on a Complexity Measure

Feature selection is often required as a preliminary step for many machine learning problems. However, most of the existing methods only work in a centralized fashion, i.e. using the whole dataset at once. In this paper we propose a new methodology for distributing the feature selection process by samples which maintains the class distribution. Subsequently, it performs a merging procedure which updates the final feature subset according to the theoretical complexity of these features, by using data complexity measures. In this way, we provide a framework for distributed feature selection independent of the classifier and that can be used with any feature selection algorithm. The effectiveness of our proposal is tested on six representative datasets. The experimental results show that the execution time is considerably shortened whereas the performance is maintained compared to a previous distributed approach and the standard algorithms applied to the non-partitioned datasets.

[1]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[2]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[3]  Verónica Bolón-Canedo,et al.  Scaling Up Feature Selection: A Distributed Filter Approach , 2013, CAEPIA.

[4]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[5]  Madhushri Banerjee,et al.  Privacy preserving feature selection for distributed data using virtual dimension , 2011, CIKM '11.

[6]  David B. Skillicorn,et al.  Building predictors from vertically distributed data , 2004, CASCON.

[7]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[8]  Lior Rokach,et al.  Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography , 2009, Comput. Stat. Data Anal..

[9]  Huan Liu,et al.  Searching for Interacting Features , 2007, IJCAI.

[10]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[11]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[12]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[13]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[14]  M. Narasimha Murty,et al.  Scalable, Distributed and Dynamic Mining of Association Rules , 2000, HiPC.

[15]  Huan Liu,et al.  Spectral Feature Selection for Data Mining , 2011 .

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  T. Ho,et al.  Data Complexity in Pattern Recognition , 2006 .

[18]  David B. Skillicorn,et al.  Distributed prediction from vertically partitioned data , 2008, J. Parallel Distributed Comput..

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[21]  Hillol Kargupta,et al.  A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks , 2009, Knowledge and Information Systems.

[22]  Haro García,et al.  Scaling data mining algorithms. Application to instance and feature selection , 2012 .

[23]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..