GenArk: Towards a million UCSC genome browsers

Interactive graphical genome browsers are essential tools for biologists working with DNA sequences. Although tens of thousands of new genome assemblies have become available over the last decade, accessibility is limited by the work involved in manually creating browsers and curating annotations. The results can push the limits of the existing data storage infrastructure. To facilitate managing this increasing number of genome assemblies, we created the Genome Archive (GenArk) collection of UCSC Genome Browsers from assemblies hosted at NCBI (1). Built on our established assembly hub system, this collection enables fast, on-demand visualization of chromosome regions without requiring a database server. Available annotations include gene models, some mapped through whole-genome alignments, repeat masks, GC content, and others. We also modified our popular BLAT (2) aligner and in-silico PCR to support a high number of genomes using limited RAM. Users can upload additional annotations themselves via track hubs (3) and custom tracks. We can import more annotations in bulk from third-party resources, demonstrated here with TOGA (4) gene models. Our system overcomes previous technical limits on the number of genomes and annotations. At the time of writing, 2,430 GenArk assemblies are listed at https://hgdownload.soe.ucsc.edu/hubs/ and can be found by searching on the main UCSC gateway page. We will continue to add all human high-quality assemblies and for other organisms, we are looking forward to receiving requests from the research community for ever more browsers and whole-genome alignments via http://genome.ucsc.edu/assemblyRequest.html.

[1]  K. Lindblad-Toh,et al.  Integrating gene annotation with orthology inference at scale , 2023, bioRxiv.

[2]  James E. Allen,et al.  Ensembl 2022 , 2021, Nucleic Acids Res..

[3]  Eric M Weitz,et al.  Accessing NCBI data using the NCBI Sequence Viewer and Genome Data Viewer (GDV) , 2020, Genome research.

[4]  Jeremy Goecks,et al.  G-OnRamp: a Galaxy-based platform for collaborative annotation of eukaryotic genomes , 2019, Bioinform..

[5]  Katharina J. Hoff,et al.  MakeHub: Fully Automated Generation of UCSC Genome Browser Assembly Hubs , 2019, bioRxiv.

[6]  Mario Stanke,et al.  Predicting Genes in Single Genomes with AUGUSTUS , 2018, Current protocols in bioinformatics.

[7]  Juan Carlos Castilla-Rubio,et al.  Earth BioGenome Project: Sequencing life for the future of life , 2018, Proceedings of the National Academy of Sciences.

[8]  D. Karolchik,et al.  The UCSC Genome Browser database: 2017 update , 2016, Nucleic Acids Res..

[9]  K. Pruitt,et al.  P8008 The NCBI Eukaryotic Genome Annotation Pipeline , 2016 .

[10]  Deanna M. Church,et al.  Assembly: a resource for assembled genomes at NCBI , 2015, Nucleic Acids Res..

[11]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[12]  Brian T. Lee,et al.  The UCSC Genome Browser database: 2015 update , 2014, Nucleic Acids Res..

[13]  Ting Wang,et al.  Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser , 2013, Bioinform..

[14]  Alejandro A. Schäffer,et al.  WindowMasker: window-based masker for sequenced genomes , 2006, Bioinform..

[15]  Terrence S. Furey,et al.  The UCSC Genome Browser Database: update 2006 , 2005, Nucleic Acids Res..

[16]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[17]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.