Finding Biomarker Signatures in Pooled Sample Designs: A Simulation Framework for Methodological Comparisons

Detection of discriminating patterns in gene expression data can be accomplished by using various methods of statistical learning. It has been proposed that sample pooling in this context would have negative effects; however, pooling cannot always be avoided. We propose a simulation framework to explicitly investigate the parameters of patterns, experimental design, noise, and choice of method in order to find out which effects on classification performance are to be expected. We use a two-group classification task and simulated gene expression data with independent differentially expressed genes as well as bivariate linear patterns and the combination of both. Our results show a clear increase of prediction error with pool size. For pooled training sets powered partial least squares discriminant analysis outperforms discriminance analysis, random forests, and support vector machines with linear or radial kernel for two of three simulated scenarios. The proposed simulation approach can be implemented to systematically investigate a number of additional scenarios of practical interest.

[1]  R A Irizarry,et al.  On the utility of pooling biological samples in microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Sample Pooling for Microarray Analysis : A Statistical Assessment of Risks and Biases , 2004 .

[3]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[4]  Ping Xu,et al.  Computational Statistics and Data Analysis Distribution Modeling and Simulation of Gene Expression Data , 2022 .

[5]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[6]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[7]  Ross Prentice,et al.  Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. , 2004, Pharmacogenomics.

[8]  A. Smilde,et al.  How to distinguish healthy from diseased? Classification strategy for mass spectrometry‐based clinical proteomics , 2007, Proteomics.

[9]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[10]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[11]  C. Kendziorski,et al.  The efficiency of pooling mRNA in microarray experiments. , 2003, Biostatistics.

[12]  Dan Nettleton,et al.  Pooling mRNA in microarray experiments and its effect on power , 2007, Bioinform..

[13]  Alex Pothen,et al.  Computational protein biomarker prediction: a case study for prostate cancer , 2004, BMC Bioinformatics.

[14]  Kevin Dobbin,et al.  Effects of pooling mRNA in microarray class comparisons , 2004, Bioinform..

[15]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[16]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[17]  M Kathleen Kerr,et al.  Design considerations for efficient and effective microarray studies. , 2003, Biometrics.

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  D. Agranoff,et al.  Pooling serum samples may lead to loss of potential biomarkers in SELDI-ToF MS proteomic profiling , 2008, Proteome Science.

[20]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  D. DeMets,et al.  Biomarkers and surrogate endpoints: Preferred definitions and conceptual framework , 2001, Clinical pharmacology and therapeutics.

[23]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[24]  G. Izmirlian,et al.  Overview of Commonly Used Bioinformatics Methods and Their Applications , 2004, Annals of the New York Academy of Sciences.

[25]  T. Næs,et al.  Canonical partial least squares—a unified PLS approach to classification and regression problems , 2009 .

[26]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.