Toward parallel feature selection from vertically partitioned data

Feature selection is often required as a preliminary step for many pattern recognition problems. In recent years, parallel learning has been the focus of much attention due to the advent of high dimensionality. Still, most of the existing algorithms only work in a centralized manner, i.e. using the whole dataset at once. This paper proposes a parallel filter approach for vertically partitioned data. The idea is to split the data by features and then apply a filter at each partition performing several rounds to obtain a stable set of features. Later, a merging procedure is carried out to combine the results into a single subset of relevant features. Experiments on three representative datasets show that the execution time is considerably shortened whereas the performance is maintained or even improved compared to the standard algorithms applied to the non-partitioned datasets. The proposed approach can be used with any filter algorithm, so it could be seen as a general framework for parallel feature selection.

[1]  Verónica Bolón-Canedo,et al.  Toward the scalability of neural networks through feature selection , 2013, Expert Syst. Appl..

[2]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[5]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[6]  David B. Skillicorn,et al.  Distributed prediction from vertically partitioned data , 2008, J. Parallel Distributed Comput..

[7]  Verónica Bolón-Canedo,et al.  Data classification using an ensemble of filters , 2014, Neurocomputing.

[8]  M. Narasimha Murty,et al.  Scalable, Distributed and Dynamic Mining of Association Rules , 2000, HiPC.

[9]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[10]  Grigorios Tsoumakas,et al.  Distributed Data Mining of Large Classifier Ensembles , 2002 .

[11]  Haro García,et al.  Scaling data mining algorithms. Application to instance and feature selection , 2012 .

[12]  David B. Skillicorn,et al.  Building predictors from vertically distributed data , 2004, CASCON.

[13]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[14]  Verónica Bolón-Canedo,et al.  Scaling Up Feature Selection: A Distributed Filter Approach , 2013, CAEPIA.