Resampling-based tests of functional categories in gene expression studies

William T. Barry: Resampling-based tests of functional categories in gene expression studies (Under the direction of Dr. Fred A. Wright and Dr. Andrew B. Nobel) DNA microarrays allow researchers to measure the coexpression of thousands of genes, and are commonly used to identify changes in expression either across experimental conditions or in association with some clinical outcome. With increasing availability of gene annotation, researchers have begun to ask global questions of functional genomics that explore the interactions of genes in cellular processes and signaling pathways. A common hypothesis test for gene categories is constructed as a post hoc analysis performed once a list of significant genes is identified, using classically derived tests for 2x2 contingency tables. We note several drawbacks to this approach including the violation of an independence assumption by the correlation in expression that exists among genes. To test gene categories in a more appropriate manner, we propose a flexible, permutation-based framework, termed SAFE (for Significance Analysis of Function and Expression). SAFE is a two-stage approach, whereby gene-specific statistics are calculated for the association between expression and the response of interest and then a global statistic is used to detect a shift within a gene category to more extreme associations. Significance is assessed by repeatedly permuting whole arrays whereby the correlation between all genes is held constant and accounted for. This permutation scheme also preserves the relatedness of categories containing overlapping genes, such that error rate estimates can iii be readily obtained for multiple dependent tests. Through a detailed survey of gene category tests and simulations based on real microarray, we demonstrate how SAFE generates appropriate Type I error rates as compared to other methods. Under a more rigorously defined null hypothesis, permutation-based tests of gene categories are shown to be conservative by inducing a special case with a maximum variance for the test statistic. A bootstrap-based approach to hypothesis testing is incorporated into the SAFE framework providing better coverage and improved power under a defined class of alternatives. Lastly, we extend the SAFE framework to consider gene categories in a probabilistic manner. This allows for a hypothesis test of co-regulation, using models of transcription factor binding sites to score for the presence of motifs in the upstream regions of genes.

[2]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .

[3]  W. Wong,et al.  GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space. , 2004, Applied bioinformatics.

[4]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[5]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Masahiko Shiraishi,et al.  Correlation between histone acetylation and expression of the MYO18B gene in human lung cancer cells , 2004, Genes, chromosomes & cancer.

[7]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[8]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[9]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[10]  Michael Gribskov,et al.  Combining evidence using p-values: application to sequence homology searches , 1998, Bioinform..

[11]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[12]  Karl Pearson,et al.  ON THE PROBABILITY THAT TWO INDEPENDENT DISTRIBUTIONS OF FREQUENCY ARE REALLY SAMPLES FROM THE SAME PARENT POPULATION , 1932 .

[13]  L Kruglyak,et al.  A nonparametric approach for mapping quantitative trait loci. , 1995, Genetics.

[14]  Chen-An Tsai,et al.  Estimation of False Discovery Rates in Multiple Testing: Application to Gene Microarray Data , 2003, Biometrics.

[15]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[16]  F. Wright,et al.  Gene expression profiling of isogenic cells with different TP53 gene dosage reveals numerous genes that are affected by TP53 dosage and identifies CSPG2 as a direct target of p53 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[17]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[18]  A. Fraser,et al.  A probabilistic view of gene function , 2004, Nature Genetics.

[19]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[20]  M. Kanehisa A database for post-genome analysis. , 1997, Trends in genetics : TIG.

[21]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[22]  Steven C. Lawlor,et al.  MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data , 2003, Genome Biology.

[23]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[24]  Hagai Bergman,et al.  Identifying subtle interrelated changes in functional gene categories using continuous measures of gene expression , 2005, Bioinform..

[25]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[26]  V. Arango,et al.  Using the Gene Ontology for Microarray Data Mining: A Comparison of Methods and Application to Age Effects in Human Prefrontal Cortex , 2004, Neurochemical Research.

[27]  E S Lander,et al.  Ploidy regulation of gene expression. , 1999, Science.

[28]  A. Cornish-Bowden Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. , 1985, Nucleic acids research.

[29]  W. Wong,et al.  Transitive functional annotation by shortest-path analysis of gene expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[31]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Daniel J. Vis,et al.  T-profiler: scoring the activity of predefined groups of genes using gene expression data , 2005, Nucleic Acids Res..

[33]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[34]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[35]  E. Korn,et al.  An Example of Slow Convergence of the Bootstrap in High Dimensions , 2004 .

[36]  Jun S. Liu,et al.  Decoding human regulatory circuits. , 2004, Genome research.

[37]  A. Butte,et al.  Microarrays for an Integrative Genomics , 2002 .

[38]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[39]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[40]  Pierre R. Bushel,et al.  Assessing Gene Significance from cDNA Microarray Expression Data via Mixed Models , 2001, J. Comput. Biol..

[41]  Christine A Iacobuzio-Donahue,et al.  Claudin 4 protein expression in primary and metastatic pancreatic cancer: support for use as a therapeutic target. , 2004, American journal of clinical pathology.

[42]  T. Speed,et al.  GOstat: find statistically overrepresented Gene Ontologies within a group of genes. , 2004, Bioinformatics.

[43]  Thomas Lengauer,et al.  Statistical Applications in Genetics and Molecular Biology Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data , 2011 .

[44]  Y. Benjamini,et al.  Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[45]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[46]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[47]  Joaquín Dopazo,et al.  FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes , 2004, Bioinform..

[48]  B. Honoré,et al.  Functional genomics studied by proteomics. , 2004, BioEssays : news and reviews in molecular, cellular and developmental biology.

[49]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[50]  W. Wurst,et al.  Permutation-validated principal components analysis of microarray data , 2002, Genome Biology.

[51]  S. Young,et al.  p Value Adjustments for Multiple Tests in Multivariate Binomial Models , 1989 .

[52]  J. Minna,et al.  MYO18B, a candidate tumor suppressor gene at chromosome 22q12.1, deleted, mutated, and methylated in human lung cancer , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[54]  C. Lawrence,et al.  Human-mouse genome comparisons to locate regulatory sites , 2000, Nature Genetics.

[55]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[56]  Aravind Subramanian,et al.  Identification of distinct molecular phenotypes in acute megakaryoblastic leukemia by gene expression profiling. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[58]  Guoying Liu,et al.  NetAffx: Affymetrix probesets and annotations , 2003, Nucleic Acids Res..

[59]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Y. Chen,et al.  Ratio-based decisions and the quantitative analysis of cDNA microarray images. , 1997, Journal of biomedical optics.

[61]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[62]  J. Schneider,et al.  Nup88 mRNA overexpression is associated with high aggressiveness of breast cancer , 2004, International journal of cancer.

[63]  F. Yates,et al.  Tests of Significance for 2 × 2 Contingency Tables , 1984 .

[64]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[65]  J. Shay,et al.  A transcriptionally active DNA-binding site for human p53 protein complexes , 1992, Molecular and cellular biology.

[66]  George Casella,et al.  Statistical Inference Second Edition , 2007 .

[67]  Fei Zou,et al.  Rank-based statistical methodologies for quantitative trait locus mapping. , 2003, Genetics.

[68]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[69]  May D. Wang,et al.  GoMiner: a resource for biological interpretation of genomic and proteomic data , 2003, Genome Biology.

[70]  Seon-Young Kim,et al.  PAGE: Parametric Analysis of Gene Set Enrichment , 2005, BMC Bioinform..

[71]  D. Damian,et al.  Statistical concerns about the GSEA procedure , 2004, Nature Genetics.

[72]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[73]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[74]  Bing Zhang,et al.  GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies , 2004, BMC Bioinformatics.

[75]  Jun S. Liu,et al.  De novo cis-regulatory module elicitation for eukaryotic genomes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[76]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[77]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[78]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[79]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[80]  R. Beran Prepivoting to reduce level error of confidence sets , 1987 .

[81]  C. Haley,et al.  A simple regression method for mapping quantitative trait loci in line crosses using flanking markers , 1992, Heredity.

[82]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[83]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[84]  Russell D. Wolfinger,et al.  Comparison of Li-Wong and loglinear mixed models for the statistical analysis of oligonucleotide arrays , 2004, Bioinform..

[85]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[86]  Gerhard Christofori,et al.  Cell adhesion and signalling by cadherins and Ig-CAMs in cancer , 2004, Nature Reviews Cancer.

[87]  P. Khatri,et al.  Global functional profiling of gene expression. , 2003, Genomics.

[88]  Patrik Edén,et al.  Comparing Functional Annotation Analyses with Catmap Comparing Functional Annotation Analyses with Catmap , 2004 .

[89]  Hao Wang,et al.  Global regulation of erythroid gene expression by transcription factor GATA-1. , 2004, Blood.

[90]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[91]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[92]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[93]  P. Sen,et al.  Theory of rank tests , 1969 .

[94]  Howard Y. Chang,et al.  Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[95]  Kernel Smoothing to Improve Bootstrap Confidence Intervals , 1997 .

[96]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[97]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[98]  Takeshi Iwamura,et al.  Claudin-4 expression decreases invasiveness and metastatic potential of pancreatic cancer. , 2003, Cancer research.

[99]  P. Sen,et al.  Nonparametric Methods in General Linear Models. , 1986 .

[100]  D. Wolfe,et al.  Nonparametric Statistical Methods. , 1974 .

[101]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[102]  M. Caligiuri,et al.  Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[103]  Charles C. Kim,et al.  Significance analysis of lexical bias in microarray data , 2003, BMC Bioinformatics.

[104]  Bradley Efron,et al.  Censored Data and the Bootstrap , 1981 .

[105]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[106]  G. Churchill,et al.  Experimental design for gene expression microarrays. , 2001, Biostatistics.

[107]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .