A semiparametric approach for marker gene selection based on gene expression data

MOTIVATION Identification of differentially expressed genes is a major issue in gene expression data analysis and selection of marker genes is critical in tumor classification using gene expression data. In this paper, we propose a semiparametric two-sample test to identify both differentially expressed genes and select marker genes for sample classification. RESULTS A simulation study shows that the proposed method is more robust and powerful than the methods, generally used such as t-tests and non-parametric rank-sum tests, when the sample size is small. Cross-validation shows that the sample classification based on genes selected using this semiparametric method has lower misclassification rates. CONTACT hongyu.zhao@yale.edu.

[1]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[2]  M. Halperin,et al.  Estimation of the multivariate logistic risk function: a comparison of the discriminant function and maximum likelihood approaches. , 1971, Journal of chronic diseases.

[3]  N. Mantel,et al.  Alternative tests for comparing normal distribution parameters based on logistic regression. , 1974, Biometrics.

[4]  B. Efron The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis , 1975 .

[5]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .

[6]  Thomas J. Santner,et al.  A note on A. Albert and J. A. Anderson's conditions for the existence of maximum likelihood estimates in logistic regression models , 1986 .

[7]  R. Tibshirani Estimating Transformations for Regression via Additivity and Variance Stabilization , 1988 .

[8]  P. O'Brien,et al.  Comparing Two Samples: Extensions of the t, Rank-Sum, and Log-Rank Tests , 1988 .

[9]  R. Tibshirani Variance stabilization and the bootstrap , 1988 .

[10]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[11]  E. Lesaffre,et al.  Multiple‐Group Logistic Regression Diagnostics , 1989 .

[12]  Emmanuel Lesaffre,et al.  Partial Separation in Logistic Discrimination , 1989 .

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[14]  G. Tallini,et al.  ON THE EXISTENCE OF , 1996 .

[15]  J. Qin,et al.  A goodness-of-fit test for logistic regression models based on case-control data , 1997 .

[16]  L. Breiman Arcing Classifiers , 1998 .

[17]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[18]  Biao Zhang A chi-squared goodness-of-fit test for logistic regression models based on case-control data , 1999 .

[19]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[20]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[21]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Biao Zhang,et al.  An information matrix test for logistic regression models based on case-control data , 2001 .

[23]  A D Long,et al.  Improved Statistical Inference from DNA Microarray Data Using Analysis of Variance and A Bayesian Statistical Framework , 2001, The Journal of Biological Chemistry.

[24]  Biao Zhang,et al.  Assessing Goodness-of-Fit of Generalized Logit Models Based on Case-Control Data , 2002 .

[25]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[26]  Biao Zhang An em algorithm for a semiparametric finite mixture model , 2002 .

[27]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[28]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[29]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[30]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[31]  Zhong Guan,et al.  A semiparametric changepoint model , 2004 .

[32]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.