Modelling-based experiment retrieval: A case study with gene expression clustering

MOTIVATION Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case versus control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. RESULTS We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. The suggested metric for retrieval using clusterings is the normalized information distance. Empirical results finally suggest that inference for the full probabilistic model can be approximated with good performance using computationally faster heuristic clustering approaches (e.g. k-means). The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method. AVAILABILITY AND IMPLEMENTATION The method can be implemented using standard clustering algorithms and normalized information distance, available in many statistical software packages. CONTACT paul.blomstedt@aalto.fi or samuel.kaski@aalto.fi SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[2]  Mark Craven,et al.  Similarity Queries for Temporal Toxicogenomic Expression Profiles , 2008, PLoS Comput. Biol..

[3]  Samuel Kaski,et al.  Toward Computational Cumulative Biology by Combining Models of Biological Datasets , 2014, PloS one.

[4]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[5]  Patrik D'haeseleer,et al.  How does gene expression clustering work? , 2005, Nature Biotechnology.

[6]  Yidong Chen,et al.  GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus , 2008, Bioinform..

[7]  Russ B. Altman,et al.  Content-based microarray search using differential expression profiles , 2010, BMC Bioinformatics.

[8]  D. B. Dahl Modal clustering in a class of product partition models , 2009 .

[9]  Samuel Kaski,et al.  Targeted retrieval of gene expression measurements using regulatory models , 2012, Bioinform..

[10]  Ricardo J. G. B. Campello,et al.  On the selection of appropriate distances for gene expression data clustering , 2014, BMC Bioinformatics.

[11]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[12]  Lawrence Hunter,et al.  GEST: a gene expression search tool based on a novel Bayesian similarity metric , 2001, ISMB.

[13]  Samuel Kaski,et al.  Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma , 2011, Bioinform..

[14]  Ping Ma,et al.  Bayesian Inference for Gene Expression and Proteomics , 2007, Briefings Bioinform..

[15]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[16]  Emmanuel Barillot,et al.  XML, bioinformatics and data integration , 2001, Bioinform..

[17]  Ulrich Mansmann,et al.  Conceptual Aspects of Large Meta-Analyses with Publicly Available Microarray Data: A Case Study in Oncology , 2011, Bioinformatics and biology insights.

[18]  Samuel Kaski,et al.  Probabilistic retrieval and visualization of biologically relevant microarray experiments , 2009, Bioinform..

[19]  Nuno A. Fonseca,et al.  Expression Atlas update—a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments , 2013, Nucleic Acids Res..

[20]  Daniel Barry,et al.  Statistical modelling using product partition models , 2007 .

[21]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[22]  Anna Zhukova,et al.  Modeling sample variables with an Experimental Factor Ontology , 2010, Bioinform..

[23]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[24]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[25]  Alexander Schliep,et al.  Classifying short gene expression time-courses with Bayesian estimation of piecewise constant functions , 2011, Bioinform..

[26]  Jukka Corander,et al.  A Bayesian Predictive Model for Clustering Data of Mixed Discrete and Continuous Type , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  John Shawe-Taylor,et al.  Retrieval of Experiments by Efficient Comparison of Marginal Likelihoods , 2014, ICONIP.

[28]  Paul Horton,et al.  CellMontage: Similar Expression Profile Search Server , 2008, Bioinform..