Effort required to finish shotgun-generated genome sequences differs significantly among vertebrates

BackgroundThe approaches for shotgun-based sequencing of vertebrate genomes are now well-established, and have resulted in the generation of numerous draft whole-genome sequence assemblies. In contrast, the process of refining those assemblies to improve contiguity and increase accuracy (known as 'sequence finishing') remains tedious, labor-intensive, and expensive. As a result, the vast majority of vertebrate genome sequences generated to date remain at a draft stage.ResultsTo date, our genome sequencing efforts have focused on comparative studies of targeted genomic regions, requiring sequence finishing of large blocks of orthologous sequence (average size 0.5-2 Mb) from various subsets of 75 vertebrates. This experience has provided a unique opportunity to compare the relative effort required to finish shotgun-generated genome sequence assemblies from different species, which we report here. Importantly, we found that the sequence assemblies generated for the same orthologous regions from various vertebrates show substantial variation with respect to misassemblies and, in particular, the frequency and characteristics of sequence gaps. As a consequence, the work required to finish different species' sequences varied greatly. Application of the same standardized methods for finishing provided a novel opportunity to "assay" characteristics of genome sequences among many vertebrate species. It is important to note that many of the problems we have encountered during sequence finishing reflect unique architectural features of a particular vertebrate's genome, which in some cases may have important functional and/or evolutionary implications. Finally, based on our analyses, we have been able to improve our procedures to overcome some of these problems and to increase the overall efficiency of the sequence-finishing process, although significant challenges still remain.ConclusionOur findings have important implications for the eventual finishing of the draft whole-genome sequences that have now been generated for a large number of vertebrates.

[1]  Eric D Green,et al.  Comparative sequencing provides insights about the structure and conservation of marsupial and monotreme genomes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[2]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[3]  E. Green,et al.  Comparative sequence analysis of the Gdf6 locus reveals a duplicon-mediated chromosomal rearrangement in rodents and rapidly diverging coding and regulatory sequences. , 2004, Genomics.

[4]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[5]  Representation of cloned genomic sequences in two sequencing vectors: correlation of DNA sequence and subclone distribution. , 1997, Nucleic acids research.

[6]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[7]  Colin N. Dewey,et al.  Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. , 2007, Genome research.

[8]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[9]  M. Ishiura,et al.  A recB recC sbcB recJ host prevents recA-independent deletions in recombinant cosmid DNA propagated in Escherichia coli , 1989, Journal of bacteriology.

[10]  E. H. Margulies,et al.  Detection of potential GDF6 regulatory elements by multispecies sequence comparisons and identification of a skeletal joint enhancer. , 2005, Genomics.

[11]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[12]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[13]  Bradley I. Coleman,et al.  An intermediate grade of finished genomic sequence suitable for comparative analyses. , 2004, Genome research.

[14]  Huntington F Willard,et al.  Progressive proximal expansion of the primate X chromosome centromere. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Ajit Varki,et al.  Large-scale sequencing of the CD33-related Siglec gene cluster in five mammalian species reveals rapid evolution by multiple mechanisms. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[17]  E. Green Strategies for the systematic sequencing of complex genomes , 2001, Nature Reviews Genetics.

[18]  S V Razin,et al.  Non-clonability correlates with genomic instability: a case study of a unique DNA region. , 2001, Journal of molecular biology.

[19]  E. Green,et al.  Comparative sequence analyses reveal rapid and divergent evolutionary changes of the WFDC locus in the primate lineage. , 2007, Genome research.

[20]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[21]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[22]  J. Shendure,et al.  Materials and Methods Som Text Figs. S1 and S2 Tables S1 to S4 References Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome , 2022 .

[23]  Tim Hubbard Finishing the euchromatic sequence of the human genome , 2004 .

[24]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[25]  Nancy F. Hansen,et al.  Comparative analyses of multi-species sequences from targeted genomic regions , 2003, Nature.

[26]  W. Raub From the National Institutes of Health. , 1990, JAMA.

[27]  N. Shimizu,et al.  [Shotgun sequencing]. , 2019, Tanpakushitsu kakusan koso. Protein, nucleic acid, enzyme.

[28]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[29]  Miriam K. Konkel,et al.  Genome analysis of the platypus reveals unique signatures of evolution , 2008, Nature.

[30]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[31]  E. Green,et al.  Comparative sequence analyses reveal sites of ancestral chromosomal fusions in the Indian muntjac genome , 2008, Genome Biology.

[32]  Eric D. Green,et al.  Confirming the Phylogeny of Mammals by Use of Large Comparative Sequence Data Sets , 2008, Molecular biology and evolution.

[33]  Eric D Green,et al.  Parallel construction of orthologous sequence-ready clone contig maps in multiple species. , 2002, Genome research.