Shared data science infrastructure for genomics data

Background Creating a computational infrastructure to analyze the wealth of information contained in data repositories that scales well is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared Data Science Infrastructures like Boa can be used to more efficiently process and parse data contained in large data repositories. The main features of Boa are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. Results Here, we present an implementation of Boa for Genomic research (BoaG) on a relatively small data repository: RefSeq’s 97,716 annotation (GFF) and assembly (FASTA) files and metadata. We used BoaG to query the entire RefSeq dataset and gain insight into the RefSeq genome assemblies and gene model annotations and show that assembly quality using the same assembler varies depending on species. Conclusions In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, BoaG, can provide greater access to researchers to efficiently explore data in ways previously not possible for anyone but the most well funded research groups. We demonstrate the efficiency of BoaG to explore the RefSeq database of genome assemblies and annotations to identify interesting features of gene annotation as a proof of concept for much larger datasets.

[1]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[2]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[3]  M Mernik,et al.  When and how to develop domain-specific languages , 2005, CSUR.

[4]  Saurabh Bagchi,et al.  SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications , 2016, ICS.

[5]  Lavanya Ramakrishnan,et al.  Performance evaluation of a MongoDB and hadoop platform for scientific data analysis , 2013, Science Cloud '13.

[6]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[7]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[8]  G. Sudha Sadasivam,et al.  A novel approach to multiple sequence alignment using hadoop data grids , 2010, MDAC '10.

[9]  Gianluigi Zanetti,et al.  Biodoop: Bioinformatics on Hadoop , 2009, 2009 International Conference on Parallel Processing Workshops.

[10]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[11]  Kristina Chodorow,et al.  MongoDB - The Definitive Guide: Powerful and Scalable Data Storage , 2019 .

[12]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[13]  Hridesh Rajan,et al.  A Cyberinfrastructure for Big Data Transportation Engineering , 2019 .

[14]  Bertil Schmidt,et al.  Next-generation sequencing: big data meets high performance computing. , 2017, Drug discovery today.

[15]  Wolfgang Maass,et al.  S3QL: A distributed domain specific language for controlled semantic integration of life sciences data , 2011, BMC Bioinformatics.

[16]  Mary Roth,et al.  Data Wrangling: The Challenging Yourney from the Wild to the Lake , 2015, CIDR.

[17]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[18]  Hajk-Georg Drost,et al.  Biomartr: genomic data retrieval with R , 2017, Bioinform..

[19]  Damian Smedley,et al.  BioMart – biological queries made easy , 2009, BMC Genomics.

[20]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[21]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[22]  Witawas Srisa-an,et al.  Proceedings of the 9th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages , 2017, VMIL@SPLASH.

[23]  E. Koonin,et al.  Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world , 2008, Nucleic acids research.

[24]  Hridesh Rajan,et al.  Boa: Ultra-Large-Scale Software Repository and Source-Code Mining , 2015, ACM Trans. Softw. Eng. Methodol..

[25]  The 100 000 Genomes Project: bringing whole genome sequencing to the NHS , 2018, British Medical Journal.

[26]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[27]  Eija Korpelainen,et al.  Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..

[28]  Hugh P Shanahan,et al.  The application of Hadoop in structural bioinformatics , 2020, Briefings Bioinform..