Multiple Gene Sets for Cancer Classification Using Gene Range Selection Based on Random Forest

The advancement of microarray technology allows obtaining genetic information from cancer patients, as computational data and cancer classification through computation software, has become possible. Through gene selection, we can identify certain numbers of informative genes that can be grouped into a smaller sets or subset of genes; which are informative genes taken from the initial data for the purpose of classification. In most available methods, the amount of genes selected in gene subsets are dependent on the gene selection technique used and cannot be fine-tuned to suit the requirement for particular number of genes. Hence, a proposed technique known as gene range selection based on a random forest method allows selective subset for better classification of cancer datasets. Our results indicate that various gene sets assist in increasing the overall classification accuracy of the cancer related datasets, as the amount of genes can be further scrutinized to create the best subset of genes. Moreover, it can assist the gene-filtering technique for further analysis of the microarray data in gene network analysis, gene-gene interaction analysis and many other related fields.

[1]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[2]  Mei-Ling Ting Lee,et al.  Analysis of Microarray Gene Expression Data , 2004, Springer US.

[3]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[4]  Hong Yan,et al.  Missing value imputation for gene expression data: computational techniques to recover missing data from available information , 2011, Briefings Bioinform..

[5]  Mohd Saberi Mohamad,et al.  Random forest for gene selection and microarray data classification , 2011, Bioinformation.

[6]  T. Pham,et al.  Analysis of Microarray Gene Expression Data , 2006 .

[7]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[8]  Jin-Kao Hao,et al.  Advances in metaheuristics for gene selection and classification of microarray data , 2010, Briefings Bioinform..

[9]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[10]  James J. Chen,et al.  Class-imbalanced classifiers for high-dimensional data , 2013, Briefings Bioinform..

[11]  Loris Nanni,et al.  Combining multiple approaches for gene microarray classification , 2012, Bioinform..

[12]  Carolin Strobl,et al.  Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations , 2012, Briefings Bioinform..

[13]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[15]  Xiangdong Wang,et al.  Cancer bioinformatics: A new approach to systems clinical medicine , 2012, BMC Bioinformatics.

[16]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[17]  Kristel Van Steen,et al.  Travelling the world of gene-gene interactions , 2012, Briefings Bioinform..

[18]  Peter H Seeberger,et al.  Recent advances and future challenges in glycan microarray technology. , 2012, Methods in molecular biology.

[19]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[20]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[23]  Christopher Leckie,et al.  FSR: feature set reduction for scalable and accurate multi-class cancer subtype classification based on copy number , 2012, Bioinform..