Feature selection based on community detection in feature correlation networks

Feature selection is an important data preprocessing step in data mining and machine learning tasks, especially in the case of high dimensional data. In this paper, we propose a novel feature selection method based on feature correlation networks, i.e. complex weighted networks describing the strongest correlations among features in a dataset. The method utilizes community detection techniques to identify cohesive groups of features in feature correlation networks. A subset of features exhibiting a strong association with the class variable is selected according to the identified community structure taking into account the size of feature communities and connections within them. The proposed method is experimentally evaluated on a high dimensional dataset containing signaling protein features related to the diagnosis of Alzheimer’s disease. We compared the performance of seven commonly used classifiers that were trained without feature selection, after feature selection by four variants of our method determined by different community detection techniques, and after feature selection by four widely used state-of-the-art feature selection methods available in the WEKA machine learning library. The results of the experimental evaluation indicate that our method improves the classification accuracy of several classification models while greatly reducing the dimensionality of the dataset. Additionally, our method tends to outperform traditional feature selection methods provided by the WEKA library.

[1]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[2]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[3]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[4]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[7]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[8]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[9]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  M. Newman Analysis of weighted networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[12]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[14]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[15]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[16]  Dan A. Simovici,et al.  On feature selection through clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[18]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[19]  Marc M. Van Hulle,et al.  Speeding Up the Wrapper Feature Subset Selection in Regression by Mutual Information Relevance and Redundancy Analysis , 2006, ICANN.

[20]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[21]  Wlodzislaw Duch,et al.  Filter Methods , 2006, Feature Extraction.

[22]  Jason Weston,et al.  Embedded Methods , 2006, Feature Extraction.

[23]  V. Latora,et al.  Complex networks: Structure and dynamics , 2006 .

[24]  R. Tibshirani,et al.  Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins , 2007, Nature Medicine.

[25]  Michel Verleysen,et al.  Feature clustering and mutual information for the selection of variables in spectral data , 2007, ESANN.

[26]  Amparo Alonso-Betanzos,et al.  Filter Methods for Feature Selection - A Comparative Study , 2007, IDEAL.

[27]  Huan Liu,et al.  Searching for Interacting Features , 2007, IJCAI.

[28]  Martin Rosvall,et al.  Maps of Information Flow Reveal Community Structure In Complex Networks , 2007 .

[29]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[30]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[31]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[32]  S. Horvath Correlation and Gene Co-Expression Networks , 2011 .

[33]  Edwin R. Hancock,et al.  A Graph-Based Approach to Feature Selection , 2011, GbRPR.

[34]  Qinbao Song,et al.  A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[35]  Zoran Ognjanovic,et al.  Exploratory Analysis of Communities in Co-authorship Networks: A Case Study , 2014, ICT Innovations.

[36]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[37]  Yogesh R. Shepal A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data , 2014 .

[38]  Mirjana Ivanovic,et al.  A community detection technique for research collaboration networks based on frequent collaborators cores , 2016, SAC.

[39]  Vladimir Kurbalija,et al.  A Feature Selection Method Based on Feature Correlation Networks , 2017, MEDI.

[40]  Hairong Dong,et al.  A weighted Mutual Information Biclustering algorithm for gene expression data , 2017, Comput. Sci. Inf. Syst..

[41]  Saso Dzeroski,et al.  HMC-ReliefF: Feature ranking for hierarchical multi-label classification , 2018, Comput. Sci. Inf. Syst..