Feature Selection: Filter Methods Performance Challenges

Learning is the heart of intelligence. The focus in machine learning is to automate methods that achieve objectives, improve predictions or encourage informed behavior. Feature selection is a vital step in data analysis that often reduces dataset dimensionality by eliminating irrelevant and/or redundant attributes to simplify the learning process or improve outcomes’ quality. This research critically analyses different filter methods based on ranking procedures (Information Gain (IG), Chi-square (CHI), V-score, Fisher Score, mRMR, Va and ReliefF) and identifies possible challenges that arise. We particularly concentrate on how threshold determination can affect results of different filter methods based on ranked scores. We show that this issue is vital, especially in the era of big data in which users deal with attributes in the magnitudes of tens of thousands with only a limited number of instances.

[1]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[2]  Lior Wolf,et al.  Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Huan Liu,et al.  Searching for Interacting Features , 2007, IJCAI.

[4]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[5]  Perica Strbac,et al.  Toward optimal feature selection using ranking methods and classification algorithms , 2011 .

[6]  Verónica Bolón-Canedo,et al.  Recent advances and emerging challenges of feature selection in the context of big data , 2015, Knowl. Based Syst..

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[9]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[10]  Fadi Thabtah,et al.  An accessible and efficient autism screening method for behavioural data and predictive analyses , 2018, Health Informatics J..

[11]  Khairan D. Rajab,et al.  New Hybrid Features Selection Method: A Case Study on Websites Phishing , 2017, Secur. Commun. Networks.

[12]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[13]  Amparo Alonso-Betanzos,et al.  Filter Methods for Feature Selection - A Comparative Study , 2007, IDEAL.

[14]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[15]  Fadi A. Thabtah,et al.  A visualization cybersecurity method based on features' dissimilarity , 2018, Comput. Secur..

[16]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[17]  Fadi Thabtah,et al.  A new machine learning model based on induction of rules for autism detection , 2020, Health Informatics J..

[18]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[19]  Fadi Thabtah,et al.  A Feature Selection Method Based on Ranked Vector Scores of Features for Classification , 2017 .

[20]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[21]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[22]  Huan Liu,et al.  Advancing Feature Selection Research − ASU Feature Selection Repository , 2010 .

[23]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[24]  Fadi A. Thabtah,et al.  Deriving Correlated Sets of Website Features for Phishing Detection: A Computational Intelligence Approach , 2016, J. Inf. Knowl. Manag..

[25]  F. Thabtah Machine learning in autistic spectrum disorder behavioral research: A review and ways forward , 2019, Informatics for health & social care.

[26]  Isabelle Guyon,et al.  Multivariate Non-Linear Feature Selection with Kernel Multiplicative Updates and Gram-Schmidt Relief , 2003 .