Translating the overwhelming amount of data generated in high-throughput genomics experiments into biologically meaningful evidence, which may for example point to a series of biomarkers or hint at a relevant pathway, is a matter of great interest in bioinformatics these days. Genes showing similar experimental profiles, it is hypothesized, share biological mechanisms that if understood could provide clues to the molecular processes leading to pathological events. It is the topic of further study to learn if or how a priori information about the known genes may serve to explain coexpression. One popular method of knowledge discovery in high-throughput genomics experiments, enrichment analysis (EA), seeks to infer if an interesting collection of genes is 'enriched' for a Consortium particular set of a priori Gene Ontology Consortium (GO) classes. For the purposes of statistical testing, the conventional methods offered in EA software implicitly assume independence between the GO classes. Genes may be annotated for more than one biological classification, and therefore the resulting test statistics of enrichment between GO classes can be highly dependent if the overlapping gene sets are relatively large. There is a need to formally determine if conventional EA results are robust to the independence assumption. We derive the exact null distribution for testing enrichment of GO classes by relaxing the independence assumption using well-known statistical theory. In applications with publicly available data sets, our test results are similar to the conventional approach which assumes independence. We argue that the independence assumption is not detrimental.
[1]
M. Orešič,et al.
Pathways to the analysis of microarray data.
,
2005,
Trends in biotechnology.
[2]
J. Downing,et al.
Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells
,
2003,
Nature Genetics.
[3]
J. Foekens,et al.
Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer
,
2005,
The Lancet.
[4]
John Crowley,et al.
Global gene expression profiling of multiple myeloma, monoclonal gammopathy of undetermined significance, and normal bone marrow plasma cells.
,
2002,
Blood.
[5]
N. L. Johnson,et al.
Discrete Multivariate Distributions
,
1998
.
[6]
M. Ashburner,et al.
Gene Ontology: tool for the unification of biology
,
2000,
Nature Genetics.
[7]
Y. Benjamini,et al.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
,
1995
.
[8]
A. Agresti.
Categorical data analysis
,
1993
.
[9]
M. Daly,et al.
PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes
,
2003,
Nature Genetics.