Effects of Sample Size on Differential Gene Expression, Rank Order and Prediction Accuracy of a Gene Signature

Top differentially expressed gene lists are often inconsistent between studies and it has been suggested that small sample sizes contribute to lack of reproducibility and poor prediction accuracy in discriminative models. We considered sex differences (69♂, 65♀) in 134 human skeletal muscle biopsies using DNA microarray. The full dataset and subsamples (n = 10 (5♂, 5♀) to n = 120 (60♂, 60♀)) thereof were used to assess the effect of sample size on the differential expression of single genes, gene rank order and prediction accuracy. Using our full dataset (n = 134), we identified 717 differentially expressed transcripts (p<0.0001) and we were able predict sex with ∼90% accuracy, both within our dataset and on external datasets. Both p-values and rank order of top differentially expressed genes became more variable using smaller subsamples. For example, at n = 10 (5♂, 5♀), no gene was considered differentially expressed at p<0.0001 and prediction accuracy was ∼50% (no better than chance). We found that sample size clearly affects microarray analysis results; small sample sizes result in unstable gene lists and poor prediction accuracy. We anticipate this will apply to other phenotypes, in addition to sex.

[1]  Leming Shi,et al.  Effect of training-sample size and classification difficulty on the accuracy of genomic predictors , 2010, Breast Cancer Research.

[2]  Stanley Heshka,et al.  Total body skeletal muscle and adipose tissue volumes: estimation from a single abdominal cross-sectional image. , 2004, Journal of applied physiology.

[3]  Y. Benjamini,et al.  Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[4]  E. Hoffman,et al.  Skeletal muscle gene expression in response to resistance exercise: sex specific regulation , 2010, BMC Genomics.

[5]  Andrei Yakovlev,et al.  Is there an alternative to increasing the sample size in microarray studies? , 2007, Bioinformation.

[6]  E. Metter,et al.  MICROARRAY ANALYSIS OF MUSCLE GENE EXPRESSION: INFLUENCE OF AGE, SEX, AND STRENGTH TRAINING , 2002 .

[7]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[8]  Yingdong Zhao,et al.  How Large a Training Set is Needed to Develop a Classifier for Microarray Data? , 2008, Clinical Cancer Research.

[9]  C. Virtanen,et al.  Muscling in on microarrays. , 2008, Applied physiology, nutrition, and metabolism = Physiologie appliquee, nutrition et metabolisme.

[10]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[11]  N. Sneige,et al.  Estrogen Receptor Analysis for Breast Cancer: Current Issues and Keys to Increasing Testing Accuracy , 2005, Advances in anatomic pathology.

[12]  R Simon,et al.  Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data , 2003, British Journal of Cancer.

[13]  C. Däpp,et al.  Transcriptional profiling of tissue plasticity: role of shifts in gene expression and technical limitations. , 2005, Journal of applied physiology.

[14]  Robert J. Isfort,et al.  Sex Differences in Global mRNA Content of Human Skeletal Muscle , 2009, PLoS ONE.

[15]  Lajos Pusztai,et al.  Molecular classification of breast cancer: limitations and potential. , 2006, The oncologist.

[16]  Fabien Reyal,et al.  Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability , 2008, BMC Genomics.

[17]  T P Speed,et al.  Experimental design and low-level analysis of microarray data. , 2004, International review of neurobiology.

[18]  David S. Wishart,et al.  Learning to predict cancer-associated skeletal muscle wasting from 1H-NMR profiles of urinary metabolites , 2011, Metabolomics.

[19]  Seon-Young Kim,et al.  Effects of sample size on robustness and prediction accuracy of a prognostic gene signature , 2009, BMC Bioinformatics.

[20]  D. Zaykin,et al.  Novel Rank-Based Approaches for Discovery and Replication in Genome-Wide Association Studies , 2011, Genetics.

[21]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[22]  B. Damavandi Estimating the Overlap of Top Instances in Lists Ranked by Correlation to Label , 2012 .

[23]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[24]  Douglas G Altman,et al.  Key Issues in Conducting a Meta-Analysis of Gene Expression Microarray Datasets , 2008, PLoS medicine.

[25]  J. Timmons,et al.  Oligonucleotide microarray expression profiling: Human skeletal muscle phenotype and aerobic exercise training , 2006, IUBMB life.

[26]  T. Reiman,et al.  Nutritional intervention with fish oil provides a benefit over standard of care for weight and skeletal muscle mass in patients with nonsmall cell lung cancer receiving chemotherapy , 2011, Cancer.

[27]  E. Metter,et al.  Influence of age, sex, and strength training on human muscle gene expression determined by microarray. , 2002, Physiological genomics.

[28]  Stephen Welle,et al.  Sex-Related Differences in Gene Expression in Human Skeletal Muscle , 2008, PloS one.

[29]  Shigeyuki Matsui,et al.  Sample sizes for a robust ranking and selection of genes in microarray experiments , 2009, Statistics in medicine.