A graph theoretical approach to data fusion

Abstract The rapid development of high throughput experimental techniques has resulted in a growing diversity of genomic datasets being produced and requiring analysis. Therefore, it is increasingly being recognized that we can gain deeper understanding about underlying biology by combining the insights obtained from multiple, diverse datasets. Thus we propose a novel scalable computational approach to unsupervised data fusion. Our technique exploits network representations of the data to identify similarities among the datasets. We may work within the Bayesian formalism, using Bayesian nonparametric approaches to model each dataset; or (for fast, approximate, and massive scale data fusion) can naturally switch to more heuristic modeling techniques. An advantage of the proposed approach is that each dataset can initially be modeled independently (in parallel), before applying a fast post-processing step to perform data integration. This allows us to incorporate new experimental data in an online fashion, without having to rerun all of the analysis. We first demonstrate the applicability of our tool on artificial data, and then on examples from the literature, which include yeast cell cycle, breast cancer and sporadic inclusion body myositis datasets.

[1]  Shili Lin,et al.  TopKLists: a comprehensive R package for statistical inference, stochastic aggregation, and visualization of multiple omics ranked lists , 2015, Statistical applications in genetics and molecular biology.

[2]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[3]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  J. Collado-Vides,et al.  Method DISTILLER : a data integration framework to reveal condition dependency of complex regulons in Escherichia coli , 2009 .

[5]  B. Efron Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[6]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[7]  Matthew E Ritchie,et al.  High-resolution transcription atlas of the mitotic cell cycle in budding yeast , 2010, Genome Biology.

[8]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[9]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[10]  David J. Reiss,et al.  Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks , 2006, BMC Bioinformatics.

[11]  N. Sonenberg,et al.  Assays for eukaryotic translation factors that bind mRNA. , 1997, Methods.

[12]  Jill P. Mesirov,et al.  A resampling-based method for class discovery and visualization of gene expression microarray data , 2003 .

[13]  Graham A. Colditz,et al.  Defining breast cancer prognosis based on molecular phenotypes: results from a large cohort study , 2011, Breast Cancer Research and Treatment.

[14]  Haiyuan Yu,et al.  Detecting overlapping protein complexes in protein-protein interaction networks , 2012, Nature Methods.

[15]  M. Cugmas,et al.  On comparing partitions , 2015 .

[16]  Frank L Mastaglia,et al.  Inclusion body myositis: current pathogenetic concepts and diagnostic and therapeutic approaches , 2007, The Lancet Neurology.

[17]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[18]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[19]  Pedro Machado,et al.  Sporadic inclusion body myositis: an unsolved mystery. , 2009, Acta reumatologica portuguesa.

[20]  K. Ickstadt,et al.  Improved criteria for clustering based on the posterior similarity matrix , 2009 .

[21]  K. Murata,et al.  [Sporadic inclusion body myositis]. , 2015, Nihon rinsho. Japanese journal of clinical medicine.

[22]  Jun Zhu,et al.  Simultaneous Clustering of Multiple Gene Expression and Physical Interaction Datasets , 2010, PLoS Comput. Biol..

[23]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[24]  Thomas Thorne,et al.  Graphical modelling of molecular networks underlying sporadic inclusion body myositis. , 2013, Molecular bioSystems.

[25]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[26]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[27]  Zoubin Ghahramani,et al.  Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data , 2013 .

[28]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[29]  M. Dalakas,et al.  Sporadic inclusion body myositis—diagnosis, pathogenesis and therapeutic strategies , 2006, Nature Clinical Practice Neurology.

[30]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[31]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[32]  Florian Markowetz,et al.  Patient-Specific Data Fusion Defines Prognostic Cancer Subtypes , 2011, PLoS Comput. Biol..

[33]  J. Avery Critical review. , 2006, The Journal of the Arkansas Medical Society.

[34]  R. Altman,et al.  Personal Genomic Measurements: The Opportunity for Information Integration , 2013, Clinical pharmacology and therapeutics.

[35]  Robert Gentleman,et al.  A graph-theoretic approach to testing associations between disparate sources of functional genomics data , 2004, Bioinform..

[36]  Ian O Ellis,et al.  Basal-like breast cancer: a critical review. , 2008, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[37]  Matthew A. Hibbs,et al.  Exploring the human genome with functional maps. , 2009, Genome research.

[38]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[39]  Bin Li,et al.  Overlap maximum matching ratio (OMMR): a new measure to evaluate overlaps of essential modules , 2015, Frontiers of Information Technology & Electronic Engineering.

[40]  Min Wu,et al.  A core-attachment based method to detect protein complexes in PPI networks , 2009, BMC Bioinformatics.

[41]  Neil Burgess A GRAPH-THEORETIC APPROACH TO TESTING , 1989 .

[42]  Matthew A. Hibbs,et al.  Discovery of biological networks from diverse functional genomic data , 2005, Genome Biology.

[43]  Olga G. Troyanskaya,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm332 Data and text mining , 2022 .