optimizing the Research of DNA Sequences in a NoSQL Document Database: A Preliminary Study

The study of DNA sequences has become indis-pensable for basic biological research, and in numerous applied fields such as comparative genomics, evolutionary biology, pan genomics, genetics of disease, regulation of gene expression, oncology and many others, all supported by bioinformatics. In the era of Cloud computing, federating the Cloud systems of different genetics research organisations paves the way towards a new era of data sharing and new mashup services and applications. However, due to the huge amount of genomics data (genomics Big Data) that have to be managed, a parallel distributed NoSQL DataBase Management System (DBMS) approach becomes fundamental. Specifically, due to the textual nature of genomics data, a NoSQL DBMS appears to be the most suitable solution. In this paper, by considering the whole human genome, we present a preliminary study comparing this latter using MongoDB with a SQL-like database solution, i.e., MySQL in order to look for DNA sequences. Moreover, in order to optimize the research of genomics codes, we adopt hash functions that allow mapping nucleotides sequences of arbitrary size onto data of a fixed smaller size. Experiments, shows that MongoDB apart simplifying the management of genomics data provides better performances.

[1]  J. Shendure,et al.  DNA sequencing at 40: past, present and future , 2017, Nature.

[2]  Maria Fazio,et al.  Are Next-Generation Sequencing Tools Ready for the Cloud? , 2017, Trends in biotechnology.

[3]  Michael Hackenberg,et al.  NGSmethDB 2017: enhanced methylomes and differential methylation , 2016, Nucleic Acids Res..

[4]  Antonio Celesti,et al.  Big data analytics in genomics: The point on Deep Learning solutions , 2017, 2017 IEEE Symposium on Computers and Communications (ISCC).

[5]  Patrick Girard,et al.  BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species with DNA Signatures through Metagenomics Samples , 2014, ITBAM.

[6]  José Valverde,et al.  Bioinformatics and Computational Biology Systems design applied to Nanobiotechnology , 2016, 2016 IEEE 36th Central American and Panama Convention (CONCAPAN XXXVI).

[7]  Jie Tan,et al.  Big Data Bioinformatics , 2014, Journal of cellular physiology.

[8]  Matthew R. Laird,et al.  MicrobeDB: a locally maintainable database of microbial genomic sequences , 2012, Bioinform..

[9]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[10]  D. Stirling,et al.  A short history of the polymerase chain reaction. , 2003, Methods in molecular biology.

[11]  J. Pfeilschifter,et al.  Biglycan, a Danger Signal That Activates the NLRP3 Inflammasome via Toll-like and P2X Receptors* , 2009, The Journal of Biological Chemistry.

[12]  E. Wieczorek,et al.  mRNA, microRNA and lncRNA as novel bladder tumor markers. , 2018, Clinica chimica acta; international journal of clinical chemistry.

[13]  Jangampalli Adi Pradeepkiran,et al.  CGMD: An integrated database of cancer genes and markers , 2014, Scientific Reports.

[14]  Yunpeng Cai,et al.  A survey on database resources for microRNA-disease relationships. , 2016, Briefings in functional genomics.

[15]  D. Haft,et al.  Using comparative genomics to drive new discoveries in microbiology. , 2015, Current opinion in microbiology.

[16]  U. Gezer,et al.  Investigation of circulating lncRNAs in B-cell neoplasms. , 2014, Clinica chimica acta; international journal of clinical chemistry.

[17]  Antonio Celesti,et al.  Why Deep Learning Is Changing the Way to Approach NGS Data Processing: A Review , 2018, IEEE Reviews in Biomedical Engineering.

[18]  Brian D. O'Connor,et al.  SeqWare Query Engine: storing and searching sequence data in the cloud , 2010, BMC Bioinformatics.

[19]  Maria Fazio,et al.  New trends in Biotechnology: The point on NGS Cloud computing solutions , 2016, 2016 IEEE Symposium on Computers and Communication (ISCC).

[20]  L. Tang,et al.  Analysis of LDLR variants from homozygous FH patients carrying multiple mutations in the LDLR gene. , 2017, Atherosclerosis.

[21]  Alfredo Pulvirenti,et al.  Comprehensive Reconstruction and Visualization of Non-Coding Regulatory Networks in Human , 2014, Front. Bioeng. Biotechnol..

[22]  E. Mardis DNA sequencing technologies: 2006–2016 , 2017, Nature Protocols.

[23]  Pierre Larmande,et al.  Gigwa—Genotype investigator for genome-wide analyses , 2016, GigaScience.

[24]  F. Uhlmann SMC complexes: from DNA to chromosomes , 2016, Nature Reviews Molecular Cell Biology.