Mutually-exclusive-and-collectively-exhaustive feature selection scheme

Abstract In the fields of machine learning and data mining, feature selection methods are used to identify the most cost-effective predictors and to give a deeper understanding of pattern recognition and extraction. This study proposes a novel mutually-exclusive-and-collectively-exhaustive (MECE) feature selection scheme. Based on the MECE principle in decision science, the scheme, which has three stages including evaluation of independence, evaluation of importance and evaluation of completeness, aims to identify the independent and important variables with complete information. A case study of fault classification in semiconductor manufacturing and a study of breast cancer relapse identification in bioinformatics are used to validate the proposed scheme. The results demonstrate that the proposed MECE scheme selects fewer variables, avoids the multicollinearity problem, and improves fault classification accuracy in the two case studies.

[1]  Stephen V. Stehman,et al.  Selecting and interpreting measures of thematic classification accuracy , 1997 .

[2]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[3]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Chih-Fong Tsai,et al.  Combining multiple feature selection methods for stock prediction: Union, intersection, and multi-intersection approaches , 2010, Decis. Support Syst..

[5]  Meili Baragatti,et al.  Bayesian Variable Selection for Probit Mixed Models Applied to Gene Selection , 2011, 1101.4577.

[6]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[7]  Hongnian Yu,et al.  Mutual information based input feature selection for classification problems , 2012, Decis. Support Syst..

[8]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[9]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[10]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[13]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[14]  Nittaya Kerdprasop,et al.  Feature Selection and Boosting Techniques to Improve Fault Detection Accuracy in the Semiconductor Manufacturing Process , 2011 .

[15]  Chia-Yen Lee,et al.  Aggregate demand forecast with small data and robust capacity decision in TFT-LCD manufacturing , 2016, Comput. Ind. Eng..

[16]  Hsiao-Fan Wang,et al.  Factor analysis in data mining , 2005 .

[17]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[18]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[19]  Vadlamani Ravi,et al.  Detection of financial statement fraud and feature selection using data mining techniques , 2011, Decis. Support Syst..

[20]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[21]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[22]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[23]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[24]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[25]  Hsiao-Fan Wang,et al.  Factor analysisin data mining , 2004 .

[26]  G. M. Oyeyemi,et al.  On Performance of Shrinkage Methods – A Monte Carlo Study , 2015 .

[27]  Wen‐Jun Zhang,et al.  Comparison of different methods for variable selection , 2001 .

[28]  Zhiqiang Ge,et al.  Semiconductor Manufacturing Process Monitoring Based on Adaptive Substatistical PCA , 2010, IEEE Transactions on Semiconductor Manufacturing.

[29]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[30]  Yuhua Li,et al.  Causality Challenge: Benchmarking relevant signal components for effective monitoring and process control , 2008, NIPS Causality: Objectives and Assessment.