KibioR & Kibio: a new architecture for next-generation data querying and sharing in big biology

MOTIVATION The growing production of massive heterogeneous biological data offers opportunities for new discoveries. However, performing multi-omics data analysis is challenging, and researchers are forced to handle the ever-increasing complexity of both data management and evolution of our biological understanding. Substantial efforts have been made to unify biological datasets into integrated systems. Unfortunately, they are not easily scalable, deployable and searchable, locally or globally. RESULTS This publication presents two tools with a simple structure that can help any data provider, organization or researcher, requiring a reliable data search and analysis base. The first tool is Kibio, a scalable and adaptable data storage based on Elasticsearch search engine. The second tool is KibioR, a R package to pull, push and search Kibio datasets or any accessible Elasticsearch-based databases. These tools apply a uniform data exchange model and minimize the burden of data management by organizing data into a decentralized, versatile, searchable and shareable structure. Several case studies are presented using multiple databases, from drug characterization to miRNAs and pathways identification, emphasizing the ease of use and versatility of the Kibio/KibioR framework. AVAILABILITY Both KibioR and Elasticsearch are open source. KibioR package source is available at https://github.com/regisoc/kibior and the library on CRAN at https://cran.r-project.org/package=kibior. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Robert Petryszak,et al.  Discovering and linking public omics data sets using the Omics Discovery Index , 2017, Nature Biotechnology.

[2]  David S. Wishart,et al.  T3DB: a comprehensively annotated database of common toxins and their targets , 2009, Nucleic Acids Res..

[3]  Fang-Ming Deng,et al.  MicroRNAs as predictive biomarkers and therapeutic targets in prostate cancer. , 2014, American journal of clinical and experimental urology.

[4]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[5]  G. Nuovo,et al.  Experimental validation of miRNA targets. , 2008, Methods.

[6]  Robert W. Reid,et al.  Automated gene data integration with Databio , 2020, BMC Research Notes.

[7]  Esti Yeger Lotem,et al.  The TissueNet v.2 database: A quantitative view of protein-protein interactions across human tissues , 2016, Nucleic Acids Res..

[8]  Edward M. Marcotte,et al.  Exploiting Big Biology: Integrating Large-scale Biological Data for Function Inference , 2001, Briefings Bioinform..

[9]  Arek Kasprzyk,et al.  BioMart: driving a paradigm change in biological data management , 2011, Database J. Biol. Databases Curation.

[10]  Michael Petrascheck,et al.  The DrugAge database of aging‐related drugs , 2017, Aging cell.

[11]  Núria Queralt-Rosinach,et al.  DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants , 2016, Nucleic Acids Res..

[12]  Thomas Craig,et al.  LongevityMap: a database of human genetic variants associated with longevity. , 2013, Trends in genetics : TIG.

[13]  Monya Baker,et al.  Big biology: The ’omes puzzle , 2013, Nature.

[14]  David S. Wishart,et al.  HMDB 4.0: the human metabolome database for 2018 , 2017, Nucleic Acids Res..

[15]  David S. Wishart,et al.  DrugBank 5.0: a major update to the DrugBank database for 2018 , 2017, Nucleic Acids Res..

[16]  Karin Breuer,et al.  InnateDB: systems biology of innate immunity and beyond—recent updates and continuing curation , 2012, Nucleic Acids Res..

[17]  Hsien-Da Huang,et al.  miRTarBase update 2018: a resource for experimentally validated microRNA-target interactions , 2017, Nucleic Acids Res..

[18]  J. D. Watson,et al.  Human Genome Project: Twenty-five years of big biology , 2015, Nature.

[19]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2015 , 2014, Nucleic Acids Res..

[20]  S. Tanumihardjo,et al.  Vitamin A: biomarkers of nutrition for development. , 2011, The American journal of clinical nutrition.

[21]  L. Rojas,et al.  Metformin: an old but still the best treatment for type 2 diabetes , 2013, Diabetology & Metabolic Syndrome.

[22]  Xosé M. Fernández,et al.  The 27th annual Nucleic Acids Research database issue and molecular biology database collection , 2019, Nucleic Acids Res..

[23]  Judith A. Blake,et al.  Mouse Genome Database (MGD) 2019 , 2018, Nucleic Acids Res..

[24]  Andrew R. Leach,et al.  ChEMBL: towards direct deposition of bioassay data , 2018, Nucleic Acids Res..

[25]  Bart De Moor,et al.  BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[26]  Lucila Ohno-Machado,et al.  DataMed – an open source discovery index for finding biomedical datasets , 2018, J. Am. Medical Informatics Assoc..

[27]  Jing Yan,et al.  Targeting VEGF/VEGFR to Modulate Antitumor Immunity , 2018, Front. Immunol..

[28]  Lidong Wang,et al.  Heterogeneous Data and Big Data Analytics , 2017 .

[29]  Shuai Huang,et al.  Inhibitory action of pristimerin on hypoxia‑mediated metastasis involves stem cell characteristics and EMT in PC-3 prostate cancer cells. , 2015, Oncology reports.

[30]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2010, Nucleic Acids Res..

[31]  João Pedro de Magalhães,et al.  The Digital Ageing Atlas: integrating the diversity of age-related changes into a unified resource , 2014, Nucleic Acids Res..

[32]  Ana Kozomara,et al.  miRBase: from microRNA sequences to function , 2018, Nucleic Acids Res..

[33]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[34]  Hajk-Georg Drost,et al.  Biomartr: genomic data retrieval with R , 2017, Bioinform..

[35]  Gerardo Botti,et al.  Micrornas in prostate cancer: an overview , 2017, Oncotarget.

[36]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[37]  Konrad J. Karczewski,et al.  Integrative omics for health and disease , 2018, Nature Reviews Genetics.

[38]  Silvio C. E. Tosatto,et al.  InterPro in 2019: improving coverage, classification and access to protein sequence annotations , 2018, Nucleic Acids Res..

[39]  P. Arner,et al.  Global transcriptome profiling identifies KLF15 and SLC25A10 as modifiers of adipocytes insulin sensitivity in obese women , 2017, PloS one.

[40]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[41]  Mark Gerstein,et al.  GENCODE reference annotation for the human and mouse genomes , 2018, Nucleic Acids Res..

[42]  Astrid Gall,et al.  Ensembl 2019 , 2018, Nucleic Acids Res..

[43]  Ramesh Natarajan,et al.  Activation and dysregulation of the unfolded protein response in nonalcoholic fatty liver disease. , 2008, Gastroenterology.

[44]  N. Sharma,et al.  The microRNA signatures: aberrantly expressed miRNAs in prostate cancer , 2018, Clinical and Translational Oncology.

[45]  Kara Dolinski,et al.  The BioGRID interaction database: 2017 update , 2016, Nucleic Acids Res..

[46]  Obi L. Griffith,et al.  High-performance web services for querying gene and variant annotation , 2016, Genome Biology.

[47]  Paul Spurgeon,et al.  GenESysV: a fast, intuitive and scalable genome exploration open source tool for variants generated from high-throughput sequencing projects , 2019, BMC Bioinformatics.

[48]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[49]  H. Tsukamoto Fat paradox in liver disease. , 2005, The Keio journal of medicine.

[50]  David S. Wishart,et al.  SMPDB 2.0: Big Improvements to the Small Molecule Pathway Database , 2013, Nucleic Acids Res..

[51]  Jeremy K Nicholson,et al.  Gut microbiome interactions with drug metabolism, efficacy, and toxicity. , 2017, Translational research : the journal of laboratory and clinical medicine.

[52]  Robert Gentleman,et al.  rtracklayer: an R package for interfacing with genome browsers , 2009, Bioinform..

[53]  Lana X. Garmire,et al.  More Is Better: Recent Progress in Multi-Omics Data Integration Methods , 2017, Front. Genet..

[54]  David S. Wishart,et al.  T3DB: the toxic exposome database , 2014, Nucleic Acids Res..

[55]  Ryan Miller,et al.  WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research , 2017, Nucleic Acids Res..