Detecting Repeat Families in Incompletely Sequenced Genomes

Repeats form a major class of sequence in genomes with implications for functional genomics and practical problems. Their detection and analysis pose a number of challenges in genomic sequence analysis, especially if the genome is not completely sequenced. The most abundant and evolutionary active forms of repeats are found in the form of familiesof long similar sequences. We present a novel method for repeat family detection and characterization in cases where the target genome sequence is not completely known. Therefore we first establish the sequence graph, a compacted version of sparse de Bruijn graphs. Using appropriate analysis of the structure of this graph and its connected components after local modifications, we are able to devise two algorithms for repeat family detection. The applicability of the methods is shown for both simulated and real genomic data sets.

[1]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[3]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[4]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[5]  Shahid H. Bokhari,et al.  A parallel graph decomposition algorithm for DNA sequencing with nanopores , 2005, Bioinform..

[6]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[7]  K. Brown,et al.  Graduate Texts in Mathematics , 1982 .

[8]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[9]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[10]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[11]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[12]  P. A. Biro,et al.  Ubiquitous, interspersed repeated sequences in mammalian genomes. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Lenwood S. Heath,et al.  Genomic Signatures in De Bruijn Chains , 2007, WABI.

[14]  P. Pevzner,et al.  De Novo Repeat Classification and Fragment Assembly , 2004 .

[15]  Haig H. Kazazian,et al.  Mobile elements and the human genome , 2000, Nature Reviews Genetics.

[16]  Yu Zhang,et al.  An Eulerian path approach to local multiple alignment for DNA sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[17]  João Meidanis,et al.  Introduction to computational molecular biology , 1997 .

[18]  Jens Stoye,et al.  A space efficient representation for sparse de Bruijn subgraphs , 2008 .

[19]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[20]  M. Chandler,et al.  Insertion Sequences , 2022 .

[21]  Frank Harary,et al.  Graph Theory , 2016 .

[22]  Daphne Preuss,et al.  Beyond the Arabidopsis Genome: Opportunities for Comparative Genomics1 , 2002, Plant Physiology.