A Graph Theoretical Approach to Data Fusion

The rapid development of high throughput experimental techniques has resulted in a growing diversity of genomic datasets being produced and requiring analysis. A variety of computational techniques allow us to analyse such data and to model the biological processes behind them. However, it is increasingly being recognised that we can gain deeper understanding by combining the insights obtained from multiple, diverse datasets. We therefore require scalable computational approaches for data fusion. We propose a novel methodology for scalable unsupervised data fusion. Our technique exploits network representations of the data in order to identify (and quantify) similarities among the datasets. We may work within the Bayesian formalism, using Bayesian nonparametric approaches to model each dataset; or (for fast, approximate, and massive scale data fusion) can naturally switch to more heuristic modelling techniques. An advantage of the proposed approach is that each dataset can initially be modelled independently (and therefore in parallel), before applying a fast post-processing step in order to perform data fusion. This allows us to incorporate new experimental data in an online fashion, without having to rerun all of the analysis. The methodology can be applied to genomic scale datasets and we demonstrate its applicability on examples from the literature, using a broad range of genomic datasets, and also on a recent gene expression dataset from Sporadic inclusion body myositis Availability. Example R code and instructions are available from https://sites.google.com/site/gtadatafusion/.

[1]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[2]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[3]  Matthew A. Hibbs,et al.  Exploring the human genome with functional maps. , 2009, Genome research.

[4]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[5]  Zoubin Ghahramani,et al.  Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data , 2013 .

[6]  Matthew A. Hibbs,et al.  Discovery of biological networks from diverse functional genomic data , 2005, Genome Biology.

[7]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[8]  Frank L Mastaglia,et al.  Inclusion body myositis: current pathogenetic concepts and diagnostic and therapeutic approaches , 2007, The Lancet Neurology.

[9]  M. Dalakas,et al.  Sporadic inclusion body myositis—diagnosis, pathogenesis and therapeutic strategies , 2006, Nature Clinical Practice Neurology.

[10]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[11]  K. Ickstadt,et al.  Improved criteria for clustering based on the posterior similarity matrix , 2009 .

[12]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[13]  K. Murata,et al.  [Sporadic inclusion body myositis]. , 2015, Nihon rinsho. Japanese journal of clinical medicine.

[14]  J. Avery Critical review. , 2006, The Journal of the Arkansas Medical Society.

[15]  Thomas Thorne,et al.  Graphical modelling of molecular networks underlying sporadic inclusion body myositis. , 2013, Molecular bioSystems.

[16]  Olga G. Troyanskaya,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm332 Data and text mining , 2022 .

[17]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[18]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Jill P. Mesirov,et al.  A resampling-based method for class discovery and visualization of gene expression microarray data , 2003 .

[20]  B. Efron Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[21]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[22]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[23]  Matthew E Ritchie,et al.  High-resolution transcription atlas of the mitotic cell cycle in budding yeast , 2010, Genome Biology.

[24]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[25]  Ian O Ellis,et al.  Basal-like breast cancer: a critical review. , 2008, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[26]  Graham A. Colditz,et al.  Defining breast cancer prognosis based on molecular phenotypes: results from a large cohort study , 2011, Breast Cancer Research and Treatment.

[27]  Robert Gentleman,et al.  A graph-theoretic approach to testing associations between disparate sources of functional genomics data , 2004, Bioinform..

[28]  R. Altman,et al.  Personal Genomic Measurements: The Opportunity for Information Integration , 2013, Clinical pharmacology and therapeutics.

[29]  Florian Markowetz,et al.  Patient-Specific Data Fusion Defines Prognostic Cancer Subtypes , 2011, PLoS Comput. Biol..

[30]  M. Cugmas,et al.  On comparing partitions , 2015 .

[31]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[32]  Pedro Machado,et al.  Sporadic inclusion body myositis: an unsolved mystery. , 2009, Acta reumatologica portuguesa.