De novo repeat classification and fragment assembly

Repetitive sequences make up a significant fraction of almost any genome and an important and still open question in bioinformatics is how to represent all repeats in DNA sequences. We propose a radically new approach to repeat classification that is motivated by the fundamental topological notion of quotient spaces. A torus or Klein bottle are examples of quotient spaces that can be obtained from a square by gluing some points. Our new repeat classification algorithm is based on the observation that the alignment-induced quotient space of a DNA sequence compactly represents all sequence repeats. This observation leads to a simple and efficient solution of the repeat classification problem as well as new approaches to fragment assembly and multiple alignment.

[1]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[2]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[3]  M. Waterman,et al.  Estimating the repeat structure and length of DNA sequences using L-tuples. , 2003, Genome research.

[4]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[5]  Stuart Schwartz,et al.  Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. , 2002, American journal of human genetics.

[6]  Ron Shamir,et al.  Large scale sequencing by hybridization , 2001, J. Comput. Biol..

[7]  X. Huang,et al.  An improved sequence assembly program. , 1996, Genomics.

[8]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[9]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[10]  P. Pevzner 1-Tuple DNA sequencing: computer analysis. , 1989, Journal of biomolecular structure & dynamics.

[11]  Ian Korf,et al.  MaskerAid : a performance enhancement to RepeatMasker , 2000, Bioinform..

[12]  Huanming Yang,et al.  RePS: a sequence assembler that masks exact repeats identified from the shotgun data. , 2002, Genome research.

[13]  Ron Shamir,et al.  Large Scale Sequencing by Hybridization , 2002, J. Comput. Biol..

[14]  Enno Ohlebusch,et al.  Computation and Visualization of Degenerate Repeats in Complete Genomes , 2000, ISMB.

[15]  Ron Shamir,et al.  A computational method for resequencing long DNA targets by universal oligonucleotide arrays , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[17]  P. Pevzner,et al.  Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes. , 2004, Genome research.

[18]  M. C. Butler,et al.  Human Transaldolase-associated Repetitive Elements Are Transcribed by RNA Polymerase III* , 2000, Journal of Biological Chemistry.

[19]  Sebastian Böcker,et al.  Sequencing from Compomers: Using Mass Spectrometry for DNA De-Novo Sequencing of 200+ nt , 2003, WABI.

[20]  Pavel A. Pevzner,et al.  Computational molecular biology : an algorithmic approach , 2000 .

[21]  Stefan Friedrich,et al.  Topology , 2019, Arch. Formal Proofs.

[22]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[23]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[24]  X. Huang,et al.  A contig assembly program based on sensitive detection of fragment overlaps. , 1992, Genomics.

[25]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[26]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[27]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[28]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[29]  Yu Zhang,et al.  An Eulerian Path Approach to Global Multiple Alignment for DNA Sequences , 2003, J. Comput. Biol..

[30]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[32]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[33]  Pankaj Agarwal,et al.  The Repeat Pattern Toolkit (RPT): Analyzing the Structure and Evolution of the C. elegans Genome , 1994, ISMB.

[34]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[35]  Mihai Pop,et al.  Genome Sequence Assembly: Algorithms and Issues , 2002, Computer.

[36]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[37]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[38]  P. Pevzner,et al.  De Novo Repeat Classification and Fragment Assembly , 2004 .