Efficiently Identifying Max-Gap Clusters in Pairwise Genome Comparison

The spatial clustering of genes across different genomes has been used to study important problems in comparative genomics, from identification of operons to detection of homologous regions. A set of formal models and algorithms of so-called max-gap clusters have been proposed recently. These algorithms guarantee the completeness of the results, and the simplicity of the model enables a rigorous statistical test of significance. These features overcome the weakness of many previous methods, which are often heuristic in nature. We developed a very efficient algorithm to compute max-gap clusters in pairwise genome comparison. Our algorithm is an order-of-magnitude faster than the previous algorithm based on the same model under a number of different settings. In our evaluation on two bacterial genomes, we showed that our method could identify known operons as well as some novel structures in the genome. We also demonstrated that the current framework for conserved spatial clustering of genes can be used to detect homologous regions in higher organisms, through the comparison of human and mouse genomes.

[1]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[2]  Pierre Baldi,et al.  Statistical detection of chromosomal homology using shared-gene density alone , 2005, Bioinform..

[3]  Xin He,et al.  Identifying Conserved Gene Clusters in the Presence of Homology Families , 2005, J. Comput. Biol..

[4]  Jeremy Buhler,et al.  Operon prediction without a training set , 2005, Bioinform..

[5]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[6]  Y. Yan,et al.  Zebrafish comparative genomics and the origins of vertebrate chromosomes. , 2000, Genome research.

[7]  David Sankoff,et al.  The Statistical Analysis of Spatially Clustered Genes under the Maximum Gap Criterion , 2005, J. Comput. Biol..

[8]  Peer Bork,et al.  Comparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages. , 2005, Genome research.

[9]  Klaas Vandepoele,et al.  The hidden duplication past of Arabidopsis thaliana , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Zhe Li,et al.  Statistical inference of chromosomal homology based on gene colinearity and applications to Arabidopsis and rice , 2006, BMC Bioinformatics.

[11]  S. Cannon,et al.  DiagHunter and GenoPix2D: programs for genomic comparisons, large-scale homology discovery and visualization , 2003, Genome Biology.

[12]  S. C. Rison,et al.  A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context , 2005, Nucleic acids research.

[13]  Jiong Yang,et al.  Gene teams with relaxed proximity constraint , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[14]  Mathieu Raffinot,et al.  Gene teams: a new formalization of gene clusters for comparative genomics , 2003, Comput. Biol. Chem..

[15]  Julio Collado-Vides,et al.  RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions , 2005, Nucleic Acids Res..

[16]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[17]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[18]  Temple F. Smith,et al.  Operons in Escherichia coli: genomic analyses and predictions. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Katherine H. Huang,et al.  A novel method for accurate operon predictions in all sequenced prokaryotes , 2005, Nucleic acids research.

[20]  J. Raes,et al.  The automatic detection of homologous regions (ADHoRe) and its application to microcolinearity between Arabidopsis and rice. , 2002, Genome research.

[21]  Yu Zheng,et al.  Phylogenetic detection of conserved gene clusters in microbial genomes , 2005, BMC Bioinformatics.

[22]  P. Pevzner,et al.  Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. , 2003, Genome research.

[23]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[24]  Steven Salzberg,et al.  DAGchainer: a tool for mining segmental genome duplications and synteny , 2004, Bioinform..

[25]  C. Soderlund,et al.  SyMAP: A system for discovering and viewing syntenic regions of FPC maps. , 2006, Genome research.

[26]  Todd J. Vision,et al.  Fast identification and statistical evaluation of segmental homologies in comparative maps , 2003, ISMB.

[27]  Karsten Hokamp,et al.  Extensive genomic duplication during early chordate evolution , 2002, Nature Genetics.

[28]  Mathieu Raffinot,et al.  The Algorithmic of Gene Teams , 2002, WABI.