Feature Selection of Imbalanced Gene Expression Microarray Data

Gene expression data is a very complex data set characterised by abundant numbers of features but with a low number of observations. However, only a small number of these features are relevant to an outcome of interest. With this kind of data set, feature selection becomes a real prerequisite. This paper proposes a methodology for feature selection for an imbalanced leukaemia gene expression data based on random forest algorithm. It presents the importance of feature selection in terms of reducing the number of features, enhancing the quality of machine learning and providing better understanding for biologists in diagnosis and prediction. Algorithms are presented to show the methodology and strategy for feature selection taking care to avoid over fitting. Moreover, experiments are done using imbalanced Leukaemia gene expression data and special measurement is used to evaluate the quality of feature selection and performance of classification.

[1]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[2]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[3]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[6]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[7]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[8]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[10]  Francisco Azuaje,et al.  An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors , 2006, BMC Medical Informatics Decis. Mak..

[11]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[12]  Michal Linial,et al.  Novel Unsupervised Feature Filtering of Biological Data , 2006, ISMB.