Data intensive parallel feature selection method study

Feature selection is an important research topic in machine learning and pattern recognition. It is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. With the development of computer science, data deluge occurs in many application fields. Classical feature selection method is out of work in processing large-scale dataset because of expensive computational cost. This paper mainly concentrates on the study of data intensive parallel feature selection method. The parallel feature selection method is based on MapReduce program model. In each map node, a novel method is used to calculate the mutual information and combinatory contribution degree is used to determine the number of selected features. In each epoch, selected features of all map nodes are collected to a reduce node and from which a feature is selected through synthesization. The parallel feature selection method is scalable. The efficiency of the method is illustrated through an example analysis.

[1]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[2]  Geoffrey Fox,et al.  Study on Parallel SVM Based on MapReduce , 2012 .

[3]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[4]  Heng Tao Shen,et al.  Dimensionality Reduction , 2009, Encyclopedia of Database Systems.

[5]  Geoffrey C. Fox,et al.  Applying Twister to Scientific Applications , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[6]  Wang Xiao-dan,et al.  Summary of feature selection algorithms , 2012 .

[7]  Geoffrey C. Fox,et al.  Parallel Data Mining from Multicore to Cloudy Grids , 2008, High Performance Computing Workshop.

[8]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[9]  Cai Zhe,et al.  Feature Selection Algorithm Based on Kernel Distance Measure , 2010 .

[10]  Miguel A. Carreira-Perpinan,et al.  Dimensionality Reduction , 2011 .

[11]  Zhanquan Sun Parallel Feature Selection Based on MapReduce , 2014 .

[12]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[13]  Jeremy Kubica,et al.  Parallel Large Scale Feature Selection for Logistic Regression , 2009, SDM.

[14]  Zhihua Xia,et al.  Feature Selection for Image Steganalysis using Hybrid Genetic Algorithm , 2009 .

[15]  Gianluigi Zanetti,et al.  Channeling the data deluge , 2011, Nature Methods.

[16]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[17]  J. K. Bertrand,et al.  The ant colony algorithm for feature selection in high-dimension gene expression data for disease classification. , 2007, Mathematical medicine and biology : a journal of the IMA.

[18]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..