Feature Selection Applied to Microarray Data.

A typical characteristic of microarray data is that it has a very high number of features (in the order of thousands) while the number of examples is usually less than 100. In the context of microarray classification, this poses a challenge for machine learning methods, which can suffer overfitting and thus degradation in their performance. A common solution is to apply a dimensionality reduction technique before classification, to reduce the number of features. This chapter will be focused on one of the most famous dimensionality reduction techniques: feature selection. We will see how feature selection can help improve the classification accuracy in several microarray data scenarios.

[1]  Seyedali Mirjalili,et al.  Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems , 2015, Neural Computing and Applications.

[2]  Guy Karlebach,et al.  Modelling and analysis of gene regulatory networks , 2008, Nature Reviews Molecular Cell Biology.

[3]  Jessica Andrea Carballido,et al.  Discretization of gene expression data revised , 2016, Briefings Bioinform..

[4]  Patrick McConnell,et al.  An Introduction to DNA Microarrays , 2002 .

[5]  Verónica Bolón-Canedo,et al.  On the effectiveness of discretization on gene selection of microarray data , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[6]  Verónica Bolón-Canedo,et al.  Testing Different Ensemble Configurations for Feature Selection , 2017, Neural Processing Letters.

[7]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[8]  Cardona Alzate,et al.  Predicción y selección de variables con bosques aleatorios en presencia de variables correlacionadas , 2020 .

[9]  Verónica Bolón-Canedo,et al.  An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[10]  Taghi M. Khoshgoftaar,et al.  A Comparative Study of Ensemble Feature Selection Techniques for Software Defect Prediction , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[11]  Luis de Marcos,et al.  Distributed ReliefF-based feature selection in Spark , 2018, Knowledge and Information Systems.

[12]  Félix Fernando González Navarro,et al.  Feature selection in cancer research: microarray gene expression and in vivo 1h-mrs domains , 2011 .

[13]  Ali Dehghantanha,et al.  Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing , 2016, EURASIP J. Wirel. Commun. Netw..

[14]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[15]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[16]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[17]  Mohammed Azmi Al-Betar,et al.  Gene selection for cancer classification by combining minimum redundancy maximum relevancy and bat-inspired algorithm , 2017, Int. J. Data Min. Bioinform..

[18]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[19]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[20]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[22]  Verónica Bolón-Canedo,et al.  Can classification performance be predicted by complexity measures? A study using microarray data , 2017, Knowledge and Information Systems.

[23]  Feng Yang,et al.  Robust Feature Selection for Microarray Data Based on Multicriterion Fusion , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Verónica Bolón-Canedo,et al.  Distributed feature selection: An application to microarray data classification , 2015, Appl. Soft Comput..

[26]  Sven Laur,et al.  Robust rank aggregation for gene list integration and meta-analysis , 2012, Bioinform..

[27]  Witold Pedrycz,et al.  Data Mining: A Knowledge Discovery Approach , 2007 .

[28]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[29]  Verónica Bolón-Canedo,et al.  Centralized vs. distributed feature selection methods based on data complexity measures , 2017, Knowl. Based Syst..

[30]  Verónica Bolón-Canedo,et al.  Recent advances and emerging challenges of feature selection in the context of big data , 2015, Knowl. Based Syst..

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  A. Brazma,et al.  Gene expression data analysis , 2000, FEBS letters.

[33]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[34]  Sabela Ramos,et al.  Multithreaded and Spark parallelization of feature selection filters , 2016, J. Comput. Sci..

[35]  Madhushri Banerjee,et al.  Privacy preserving feature selection for distributed data using virtual dimension , 2011, CIKM '11.

[36]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[37]  Verónica Bolón-Canedo,et al.  A combination of discretization and filter methods for improving classification performance in KDD Cup 99 dataset , 2009, 2009 International Joint Conference on Neural Networks.

[38]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[39]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[40]  Ibrahim Farag,et al.  Multistage feature selection approach for high-dimensional cancer data , 2017, Soft Comput..

[41]  Yun Li,et al.  Min-Max Ensemble Feature Selection , 2017, J. Intell. Fuzzy Syst..

[42]  Kimberly F. Johnson,et al.  Methods of microarray data analysis : papers from CAMDA , 2002 .

[43]  Joaquín Dopazo,et al.  Microarray Data Processing and Analysis , 2002 .

[44]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[45]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Hillol Kargupta,et al.  A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks , 2009, Knowledge and Information Systems.

[47]  Vinod Kumar Jain,et al.  Correlation feature selection based improved-Binary Particle Swarm Optimization for gene selection and cancer classification , 2018, Appl. Soft Comput..

[48]  Mohammad Kazem Ebrahimpour,et al.  Ensemble of feature selection methods: A hesitant fuzzy sets approach , 2017, Appl. Soft Comput..

[49]  Verónica Bolón-Canedo,et al.  A Time Efficient Approach for Distributed Feature Selection Partitioning by Features , 2015, CAEPIA.

[50]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[51]  Yu Wang,et al.  Choosing Between Two Classification Learning Algorithms Based on Calibrated Balanced $$5\times 2$$5×2 Cross-Validated F-Test , 2016, Neural Processing Letters.

[52]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[53]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[54]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[55]  Jessica Andrea Carballido,et al.  Discovering time-lagged rules from microarray data using gene profile classifiers , 2011, BMC Bioinformatics.

[56]  Taghi M. Khoshgoftaar,et al.  Ensemble Feature Selection Technique for Software Quality Classification , 2010, International Conference on Software Engineering and Knowledge Engineering.

[57]  Verónica Bolón-Canedo,et al.  Fast‐mRMR: Fast Minimum Redundancy Maximum Relevance Algorithm for High‐Dimensional Big Data , 2017, Int. J. Intell. Syst..

[58]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[60]  Zheng Zhao,et al.  Massively parallel feature selection: an approach based on variance preservation , 2012, Machine Learning.

[61]  Verónica Bolón-Canedo,et al.  Data discretization: taxonomy and big data challenge , 2016, WIREs Data Mining Knowl. Discov..

[62]  Abdelkader Benyettou,et al.  Kernel-based learning and feature selection analysis for cancer diagnosis , 2017, Appl. Soft Comput..

[63]  Gregory Piatetsky-Shapiro,et al.  Microarray data mining: facing the challenges , 2003, SKDD.

[64]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[65]  Geoffrey I. Webb,et al.  Proportional k-Interval Discretization for Naive-Bayes Classifiers , 2001, ECML.

[66]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[67]  Sergio Ramírez-Gallego,et al.  Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach , 2015 .

[68]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[69]  Ivor W. Tsang,et al.  Towards ultrahigh dimensional feature selection for big data , 2012, J. Mach. Learn. Res..

[70]  Ana Carolina Lorena,et al.  Analysis of complexity indices for classification problems: Cancer gene expression data , 2012, Neurocomputing.

[71]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[72]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[73]  Peter Willett,et al.  Combination of Similarity Rankings Using Data Fusion , 2013, J. Chem. Inf. Model..

[74]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[75]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[76]  Mengjie Zhang,et al.  A New Representation in PSO for Discretisation-Based Feature Selection , 2017 .