Statistics for approximate gene clusters

BackgroundGenes occurring co-localized in multiple genomes can be strong indicators for either functional constraints on the genome organization or remnant ancestral gene order. The computational detection of these patterns, which are usually referred to as gene clusters, has become increasingly sensitive over the past decade. The most powerful approaches allow for various types of imperfect cluster conservation: Cluster locations may be internally rearranged. The individual cluster locations may contain only a subset of the cluster genes and may be disrupted by uninvolved genes. Moreover cluster locations may not at all occur in some or even most of the studied genomes. The detection of such low quality clusters increases the risk of mistaking faint patterns that occur merely by chance for genuine findings. Therefore, it is crucial to estimate the significance of computational gene cluster predictions and discriminate between true conservation and coincidental clustering.ResultsIn this paper, we present an efficient and accurate approach to estimate the significance of gene cluster predictions under the approximate common intervals model. Given a single gene cluster prediction, we calculate the probability to observe it with the same or a higher degree of conservation under the null hypothesis of random gene order, and add a correction factor to account for multiple testing. Our approach considers all parameters that define the quality of gene cluster conservation: the number of genomes in which the cluster occurs, the number of involved genes, the degree of conservation in the different genomes, as well as the frequency of the clustered genes within each genome. We apply our approach to evaluate gene cluster predictions in a large set of well annotated genomes.

[1]  Sven Rahmann,et al.  Integer Linear Programs for Discovering Approximate Gene Clusters , 2006, WABI.

[2]  Todd J. Vision,et al.  Fast identification and statistical evaluation of segmental homologies in comparative maps , 2003, ISMB.

[3]  Dannie Durand,et al.  Two Plus Two Does not Equal Three: Statistical Tests for Multiple Genome Comparison , 2008, APBC.

[4]  E. Koonin,et al.  Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. , 2001, Genome research.

[5]  Jens Stoye,et al.  Character sets of strings , 2007, J. Discrete Algorithms.

[6]  Z. Šidák Rectangular Confidence Regions for the Means of Multivariate Normal Distributions , 1967 .

[7]  Julio Collado-Vides,et al.  RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more , 2012, Nucleic Acids Res..

[8]  Y. van de Peer,et al.  i-ADHoRe 3.0—fast and sensitive detection of genomic homology in extremely large data sets , 2011, Nucleic acids research.

[9]  Jens Stoye,et al.  Algorithms for Finding Gene Clusters , 2001, WABI.

[10]  A. Agresti,et al.  Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions , 1998 .

[11]  Xin He,et al.  Detecting gene clusters under evolutionary constraint in a large number of genomes , 2009, Bioinform..

[12]  L. Gordon,et al.  Two moments su ce for Poisson approx-imations: the Chen-Stein method , 1989 .

[13]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Laxmi Parida Gapped Permutation Pattern Discovery for Gene Order Comparisons , 2007, J. Comput. Biol..

[15]  Jian Pei,et al.  OrthoCluster: a new tool for mining synteny blocks and applications in comparative genomics , 2008, EDBT '08.

[16]  Katharina Jahn Efficient Computation of Approximate Gene Clusters Based on Reference Occurrences , 2011, J. Comput. Biol..

[17]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy , 2011, Nucleic Acids Res..

[18]  Takashi Gojobori,et al.  Gene cluster analysis method identifies horizontally transferred genes with high reliability and indicates that they provide the main mechanism of operon gain in 8 species of gamma-Proteobacteria. , 2007, Molecular biology and evolution.

[19]  Jens Stoye,et al.  Gecko and GhostFam: rigorous and efficient gene cluster detection in prokaryotic genomes. , 2007, Methods in molecular biology.

[20]  Igor B. Rogozin,et al.  Computational approaches for the analysis of gene neighbourhoods in prokaryotic genomes , 2004, Briefings Bioinform..

[21]  A. Valencia,et al.  Conserved Clusters of Functionally Related Genes in Two Bacterial Genomes , 1997, Journal of Molecular Evolution.

[22]  Katherine H. Huang,et al.  A novel method for accurate operon predictions in all sequenced prokaryotes , 2005, Nucleic acids research.

[23]  David Sankoff,et al.  The Statistical Analysis of Spatially Clustered Genes under the Maximum Gap Criterion , 2005, J. Comput. Biol..

[24]  C. Dieterich,et al.  CYNTENATOR: Progressive Gene Order Alignment of 17 Vertebrate Genomes , 2010, PloS one.

[25]  Zhe Li,et al.  Statistical inference of chromosomal homology based on gene colinearity and applications to Arabidopsis and rice , 2006, BMC Bioinformatics.

[26]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[27]  B. Snel,et al.  Gene and context: integrative approaches to genome analysis. , 2000, Advances in protein chemistry.

[28]  Michael R. Thon,et al.  Identifying clusters of functionally related genes in genomes , 2007, Bioinform..

[29]  Jens Stoye,et al.  Gecko and GhostFam , 2007 .

[30]  Xin He,et al.  Efficiently Identifying Max-Gap Clusters in Pairwise Genome Comparison , 2008, J. Comput. Biol..

[31]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[32]  Jens Stoye,et al.  Computation of Median Gene Clusters , 2008, RECOMB.

[33]  Xin He,et al.  Identifying Conserved Gene Clusters in the Presence of Homology Families , 2005, J. Comput. Biol..

[34]  Charles DeLisi,et al.  Identifying functional links between genes using conserved chromosomal proximity. , 2002, Trends in genetics : TIG.

[35]  Dannie Durand,et al.  Gene Cluster Statistics with Gene Families , 2009, Molecular biology and evolution.

[36]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[37]  Mathieu Raffinot,et al.  An algorithmic view of gene teams , 2004, Theor. Comput. Sci..

[38]  David Sankoff,et al.  Tests for gene clustering , 2002, RECOMB '02.