Deep repeat resolution—the assembly of the Drosophila Histone Complex

Abstract Though the advent of long-read sequencing technologies has led to a leap in contiguity of de novo genome assemblies, current reference genomes of higher organisms still do not provide unbroken sequences of complete chromosomes. Despite reads in excess of 30 000 base pairs, there are still repetitive structures that cannot be resolved by current state-of-the-art assemblers. The most challenging of these structures are tandemly arrayed repeats, which occur in the genomes of all eukaryotes. Untangling tandem repeat clusters is exceptionally difficult, since the rare differences between repeat copies are obscured by the high error rate of long reads. Solving this problem would constitute a major step towards computing fully assembled genomes. Here, we demonstrate by example of the Drosophila Histone Complex that via machine learning algorithms, it is possible to exploit the underlying distinguishing patterns of single nucleotide variants of repeats from very noisy data to resolve a large and highly conserved repeat cluster. The ideas explored in this paper are a first step towards the automated assembly of complex repeat structures and promise to be applicable to a wide range of eukaryotic genomes.

[1]  Michael Hiller,et al.  The axolotl genome and the evolution of key tissue formation regulators , 2018, Nature.

[2]  Michael Hiller,et al.  Author Correction: The axolotl genome and the evolution of key tissue formation regulators , 2018, Nature.

[3]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[4]  T. Hastie,et al.  Neural networks and Deep Learning , 2016, Machine Learning Guide for Oil and Gas Using Python.

[5]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[6]  Giulia Antonazzo,et al.  FlyBase: establishing a Gene Group resource for Drosophila melanogaster , 2015, Nucleic Acids Res..

[7]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[8]  S. Koren,et al.  One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. , 2015, Current opinion in microbiology.

[9]  J. Roth,et al.  Mechanisms of gene duplication and amplification. , 2015, Cold Spring Harbor perspectives in biology.

[10]  Evgeniya N Andreyeva,et al.  The Release 6 reference sequence of the Drosophila melanogaster genome , 2015, Genome research.

[11]  Adam M Phillippy,et al.  Long-read, whole-genome shotgun sequence data for five model organisms , 2014, Scientific Data.

[12]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[13]  李 鎔範,et al.  数値計算のためのGNU Scientific Libraryの紹介(教育講座) , 2012 .

[14]  Renata C. Geer,et al.  The NCBI BioSystems database , 2009, Nucleic Acids Res..

[15]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[16]  Mathieu Foquet,et al.  Improved fabrication of zero-mode waveguides for single-molecule detection , 2008 .

[17]  Björn Andersson,et al.  Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs , 2002, Bioinform..

[18]  S. Howorka,et al.  Sequence-specific detection of individual DNA strands using engineered nanopores , 2001, Nature Biotechnology.

[19]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[20]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[21]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[22]  Y. Matsuo,et al.  Nucleotide variation and divergence in the histone multigene family in Drosophila melanogaster. , 1989, Genetics.

[23]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[24]  T. H. Morgan,et al.  An attempt to analyze the constitution of the chromosomes on the basis of sex-limited inheritance in Drosophila , 1911 .

[25]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[26]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[27]  D. Gusfield Algorithms on Strings, Trees, and Sequences: Multiple String Comparison – The Holy Grail , 1997 .

[28]  D. Hogness,et al.  The organization of the histone genes in Drosophila melanogaster: functional and evolutionary implications. , 1978, Cold Spring Harbor symposia on quantitative biology.

[29]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .