Direct integration of microarrays for selecting informative genes and phenotype classification

The ability to provide thousands of gene expression values simultaneously makes microarray data very useful for phenotype classification. A major constraint in phenotype classification is that the number of genes greatly exceeds the number of samples. We overcame this constraint in two ways; we increased the number of samples by integrating independently generated microarrays that had been designed with the same biological objectives, and reduced the number of genes involved in the classification by selecting a small set of informative genes. We were able to maximally use the abundant microarray data that is being stockpiled by thousands of different research groups while improving classification accuracy. Our goal is to implement a feature (gene) selection method that can be applicable to integrated microarrays as well as to build a highly accurate classifier that permits straightforward biological interpretation. In this paper, we propose a two-stage approach. Firstly, we performed a direct integration of individual microarrays by transforming an expression value into a rank value within a sample and identified informative genes by calculating the number of swaps to reach a perfectly split sequence. Secondly, we built a classifier which is a parameter-free ensemble method using only the pre-selected informative genes. By using our classifier that was derived from large, integrated microarray sample datasets, we achieved high accuracy, sensitivity, and specificity in the classification of an independent test dataset.

[1]  Lingsong Zhang,et al.  STATISTICAL METHODS IN BIOLOGY , 1902, Nature.

[2]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[3]  Louiqa Raschid,et al.  Data Integration in the Life Sciences, Second InternationalWorkshop, DILS 2005, San Diego, CA, USA, July 20-22, 2005, Proceedings , 2005, DILS.

[4]  Peter J. Park,et al.  A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data , 2000, Pacific Symposium on Biocomputing.

[5]  Yonghong Peng,et al.  A novel ensemble machine learning for robust microarray data classification , 2006, Comput. Biol. Medicine.

[6]  Sangsoo Kim,et al.  Combining multiple microarray studies and modeling interstudy variation , 2003, ISMB.

[7]  Yanqing Zhang,et al.  Support vector machines with genetic fuzzy feature transformation for biomedical data classification , 2007, Inf. Sci..

[8]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[9]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[10]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[11]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[12]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[13]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[14]  Jian Pei,et al.  Mining phenotypes and informative genes from gene expression data , 2003, KDD '03.

[15]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  T. Barrette,et al.  Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. , 2002, Cancer research.

[18]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.

[19]  Kari Torkkola,et al.  Self-organizing maps in mining gene expression data , 2001, Inf. Sci..

[20]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .

[21]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[22]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[23]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..

[24]  E. Latulippe,et al.  Comprehensive gene expression analysis of prostate cancer reveals distinct transcriptional programs associated with metastatic disease. , 2002, Cancer research.

[25]  Jiong Yang,et al.  Integrating Heterogeneous Microarray Data Sources Using Correlation Signatures , 2005, DILS.

[26]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[27]  A. I.,et al.  Neural Field Continuum Limits and the Structure–Function Partitioning of Cognitive–Emotional Brain Networks , 2023, Biology.

[28]  Ian Witten,et al.  Data Mining , 2000 .

[29]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[30]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[31]  James D. Lawrey,et al.  Statistical Methods in Biology , 1996 .