HomologMiner: looking for homologous genomic groups in whole genomes

MOTIVATION Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding low-complexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain. RESULTS We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2) and mouse (mm8). Groups obtained include gene families (e.g. olfactory receptor gene family, zinc finger families), unannotated interspersed repeats and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families. AVAILABILITY All programs and datasets are downloadable from www.bx.psu.edu/miller_lab.

[1]  Eugene W. Myers,et al.  PILER: identification and classification of genomic repeats , 2005, ISMB.

[2]  Jason Lee,et al.  BAG: a graph theoretic sequence clustering algorithm , 2006, Int. J. Data Min. Bioinform..

[3]  Pavel A. Pevzner,et al.  De novo identification of repeat families in large genomes , 2005, ISMB.

[4]  Giorgio Valle,et al.  BIOINFORMATICS ORIGINAL PAPER Sequence analysis RAP: a new computer program for de novo identification of repeated sequences in whole genomes , 2004 .

[5]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[6]  D. Lancet,et al.  *These authors contributed equally to this study. , 2004 .

[7]  Mechthild Stoer,et al.  A simple min-cut algorithm , 1997, JACM.

[8]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[9]  P. Pevzner,et al.  De Novo Repeat Classification and Fragment Assembly , 2004 .

[10]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[11]  J. Claverie Fewer Genes, More Noncoding RNA , 2005, Science.

[12]  Hugh E. Williams,et al.  Clustering Near-Identical Sequences for Fast Homology Search , 2006, RECOMB.

[13]  E. Eichler,et al.  Analysis of segmental duplications and genome assembly in the mouse. , 2004, Genome research.

[14]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[15]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[16]  Liisa Holm,et al.  RSDB: representative protein sequence databases have high information content , 2000, Bioinform..

[17]  David Haussler,et al.  Into the heart of darkness: large-scale clustering of human non-coding DNA , 2004, ISMB/ECCB.

[18]  Yoichi Takenaka,et al.  Graph-based clustering for finding distant relationships in a large set of protein sequences , 2004, Bioinform..

[19]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.