Exploiting Genomic Relations in Big Data Repositories by Graph-Based Search Methods

We are living at a time that allows the generation of mass data in almost any field of science. For instance, in pharmacogenomics, there exist a number of big data repositories, e.g., the Library of Integrated Network-based Cellular Signatures (LINCS) that provide millions of measurements on the genomics level. However, to translate these data into meaningful information, the data need to be analyzable. The first step for such an analysis is the deliberate selection of subsets of raw data for studying dedicated research questions. Unfortunately, this is a non-trivial problem when millions of individual data files are available with an intricate connection structure induced by experimental dependencies. In this paper, we argue for the need to introduce such search capabilities for big genomics data repositories with a specific discussion about LINCS. Specifically, we suggest the introduction of smart interfaces allowing the exploitation of the connections among individual raw data files, giving raise to a network structure, by graph-based searches.

[1]  Igor Jurisica,et al.  Knowledge Discovery and Data Mining in Biomedical Informatics: The Future Is in Integrative, Interactive Machine Learning Solutions , 2014, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics.

[2]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[3]  Sophia Ananiadou,et al.  biochem4j: Integrated and extensible biochemical knowledge through graph databases , 2017, PloS one.

[4]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[5]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[6]  Witold Lipski,et al.  Information Storage and Retrieval - Mathematical Foundations II (Combinatorial Problems) , 1976, Theor. Comput. Sci..

[7]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[8]  Adam A. Margolin,et al.  Addendum: The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity , 2012, Nature.

[9]  Angela N. Brooks,et al.  A Next Generation Connectivity Map: L1000 Platform And The First 1,000,000 Profiles , 2017 .

[10]  Matthias Dehmer,et al.  The Process of Analyzing Data is the Emergent Feature of Data Science , 2016, Front. Genet..

[11]  Angela N. Brooks,et al.  A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles , 2017, Cell.

[12]  Andreas Holzinger,et al.  Interactive Knowledge Discovery and Data Mining in Biomedical Informatics , 2014, Lecture Notes in Computer Science.

[13]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[14]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[15]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[16]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[17]  R. Shoemaker The NCI60 human tumour cell line anticancer drug screen , 2006, Nature Reviews Cancer.

[18]  Avi Ma'ayan,et al.  Lean Big Data integration in systems biology and systems pharmacology. , 2014, Trends in pharmacological sciences.

[19]  D. Vidovic,et al.  Large-scale integration of small molecule-induced genome-wide transcriptional responses, Kinome-wide binding affinities and cell-growth inhibition profiles reveal global trends characterizing systems-level drug action , 2014, Front. Genet..

[20]  Lincoln Stein,et al.  Reactome knowledgebase of human biological pathways and processes , 2008, Nucleic Acids Res..

[21]  P. Bork,et al.  Drug Target Identification Using Side-Effect Similarity , 2008, Science.

[22]  Victor W. Marek,et al.  File Organization, An Application of Graph Theory , 1974, ICALP.

[23]  Meenakshisundaram Kandhavelu,et al.  Harnessing the biological complexity of Big Data from LINCS gene expression signatures , 2018, PloS one.

[24]  Laleh Soltan Ghoraie,et al.  A review of connectivity map and computational approaches in pharmacogenomics , 2017, Briefings Bioinform..

[25]  Alexander Mazein,et al.  STON: exploring biological pathways using the SBGN standard and graph databases , 2016, BMC Bioinformatics.

[26]  Sergio Contrino,et al.  ArrayExpress—a public repository for microarray gene expression data at the EBI , 2004, Nucleic Acids Res..

[27]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[28]  Daniel Himmelstein,et al.  Systematic integration of biomedical knowledge prioritizes drugs for repurposing , 2017 .

[29]  F I Carroll,et al.  Safety and efficacy of an oxycodone vaccine: Addressing some of the unique considerations posed by opioid abuse , 2017, PloS one.