FRIEND: Feature selection on inconsistent data

Abstract With the explosive growth of information, inconsistent data are increasingly common. However, traditional feature selection methods are lack of efficiency due to inconsistent data repairing beforehand. Therefore, it is necessary to take inconsistencies into consideration during feature selection to not only reduce time costs but also guarantee accuracy of machine learning models. To achieve this goal, we present FRIEND, a f eatu r e select i on approach on inconsistent data. Since features in consistency rules have higher correlation with each other, we aim to select a specific amount of features from these. We prove that the specific feature selection problem is NP-hard and develop an approximation algorithm for this problem. Extensive experimental results demonstrate the efficiency and effectiveness of our proposed approach.

[1]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[2]  Rossitza Setchi,et al.  Feature selection using Joint Mutual Information Maximisation , 2015, Expert Syst. Appl..

[3]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[4]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[5]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[6]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[7]  Jugal K. Kalita,et al.  MIFS-ND: A mutual information-based feature selection method , 2014, Expert Syst. Appl..

[8]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[9]  Tung Khac Truong,et al.  Chemical reaction optimization with greedy strategy for the 0-1 knapsack problem , 2013, Appl. Soft Comput..

[10]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[11]  Jiadong Yang,et al.  Effective search for genetic-based machine learning systems via estimation of distribution algorithms and embedded feature reduction techniques , 2013, Neurocomputing.

[12]  Leif E. Peterson K-nearest neighbor , 2009, Scholarpedia.

[13]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[14]  Jose Miguel Puerta,et al.  Speeding up incremental wrapper feature subset selection with Naive Bayes classifier , 2014, Knowl. Based Syst..

[15]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[16]  Samir Khuller,et al.  The Budgeted Maximum Coverage Problem , 1999, Inf. Process. Lett..

[17]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[18]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[19]  Bin Ran,et al.  Feature selection with redundancy-complementariness dispersion , 2015, Knowl. Based Syst..

[20]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[21]  Xing Zhang,et al.  Embedded feature-selection support vector machine for driving pattern recognition , 2015, J. Frankl. Inst..

[22]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[23]  Huan Liu,et al.  Hybrid Search of Feature Subsets , 1998, PRICAI.

[24]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[25]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[26]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Huan Liu,et al.  Embedded Unsupervised Feature Selection , 2015, AAAI.

[28]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[29]  Sildomar T. Monteiro,et al.  Embedded feature selection of hyperspectral bands with boosted decision trees , 2011, 2011 IEEE International Geoscience and Remote Sensing Symposium.

[30]  Xin-She Yang,et al.  A wrapper approach for feature selection based on Bat Algorithm and Optimum-Path Forest , 2014, Expert Syst. Appl..

[31]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[32]  Qinghua Hu,et al.  Large-margin feature selection for monotonic classification , 2012, Knowl. Based Syst..

[33]  Razieh Sheikhpour,et al.  Particle swarm optimization for bandwidth determination and feature selection of kernel density estimation based classifiers in diagnosis of breast cancer , 2016, Appl. Soft Comput..

[34]  Min Han,et al.  Global mutual information-based feature selection approach using single-objective and multi-objective optimization , 2015, Neurocomputing.

[35]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[36]  Seyed Mohammad Mirjalili,et al.  Whale optimization approaches for wrapper feature selection , 2018, Appl. Soft Comput..