Querying Genomic Databases: Refining the Connectivity Map

The advent of high-throughput biotechnologies, which can efficiently measure gene expression on a global basis, has led to the creation and population of correspondingly rich databases and compendia. Such repositories have the potential to add enormous scientific value beyond that provided by individual studies which, due largely to cost considerations, are typified by small sample sizes. Accordingly, substantial effort has been invested in devising analysis schemes for utilizing gene-expression repositories. Here, we focus on one such scheme, the Connectivity Map (cmap), that was developed with the express purpose of identifying drugs with putative efficacy against a given disease, where the disease in question is characterized by a (differential) gene-expression signature. Initial claims surrounding cmap intimated that such tools might lead to new, previously unanticipated applications of existing drugs. However, further application suggests that its primary utility is in connecting a disease condition whose biology is largely unknown to a drug whose mechanisms of action are well understood, making cmap a tool for enhancing biological knowledge.The success of the Connectivity Map is belied by its simplicity. The aforementioned signature serves as an unordered query which is applied to a customized database of (differential) gene-expression experiments designed to elicit response to a wide range of drugs, across of spectrum of concentrations, durations, and cell lines. Such application is effected by computing a per experiment score that measures "closeness" between the signature and the experiment. Top-scoring experiments, and the attendant drug(s), are then deemed relevant to the disease underlying the query. Inference supporting such elicitations is pursued via re-sampling. In this paper, we revisit two key aspects of the Connectivity Map implementation. Firstly, we develop new approaches to measuring closeness for the common scenario wherein the query constitutes an ordered list. These involve using metrics proposed for analyzing partially ranked data, these being of interest in their own right and not widely used. Secondly, we advance an alternate inferential approach based on generating empirical null distributions that exploit the scope, and capture dependencies, embodied by the database. Using these refinements we undertake a comprehensive re-evaluation of Connectivity Map findings that, in general terms, reveal that accommodating ordered queries is less critical than the mode of inference.

[1]  A. L. Rukhin Estimation of a rotation parameter on a sphere , 1975 .

[2]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[3]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[4]  D. Critchlow,et al.  A Unified Approach to Constructing Nonparametric Rank Tests. , 1986 .

[5]  Jeffrey T. Leek,et al.  Statistical Applications in Genetics and Molecular Biology The Joint Null Criterion for Multiple Hypothesis Tests , 2011 .

[6]  Qingzhong Liu,et al.  A distribution free summarization method for Affymetrix GeneChip arrays. , 2007, Bioinformatics.

[7]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[8]  Nir Friedman,et al.  Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays , 2004, Bioinform..

[9]  Shu-Dong Zhang,et al.  A simple and robust method for connecting small-molecule drugs using gene-expression signatures , 2008, BMC Bioinformatics.

[10]  T. Barrette,et al.  ONCOMINE: a cancer microarray database and integrated data-mining platform. , 2004, Neoplasia.

[11]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[12]  C. L. Mallows NON-NULL RANKING MODELS. I , 1957 .

[13]  Dennis B. Troup,et al.  NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..

[14]  A. Owen,et al.  A gene recommender algorithm to identify coexpressed genes in C. elegans. , 2003, Genome research.

[15]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[16]  P. Brown,et al.  Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Shu-Dong Zhang,et al.  sscMap: An extensible Java application for connecting small-molecule drugs using gene-expression signatures , 2009, BMC Bioinformatics.

[18]  T. Golub,et al.  Gene expression-based chemical genomics identifies rapamycin as a modulator of MCL1 and glucocorticoid resistance. , 2006, Cancer cell.

[19]  J. Frasor,et al.  Selective Estrogen Receptor Modulators , 2004, Cancer Research.

[20]  Wojtek J. Krzanowski,et al.  Improved biclustering of microarray data demonstrated through systematic performance tests , 2005, Comput. Stat. Data Anal..

[21]  George C Tseng,et al.  Tight Clustering: A Resampling‐Based Approach for Identifying Stable and Tight Patterns in Data , 2005, Biometrics.

[22]  K. Glaser,et al.  Gene expression profiling of multiple histone deacetylase (HDAC) inhibitors: defining a common gene set produced by HDAC inhibition in T24 and MDA carcinoma cell lines. , 2003, Molecular cancer therapeutics.

[23]  John D. Lafferty,et al.  Cranking: Combining Rankings Using Conditional Probability Models on Permutations , 2002, ICML.

[24]  Kai Li,et al.  Exploring the functional landscape of gene expression: directed search of large microarray compendia , 2007, Bioinform..

[25]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[26]  Guoying Liu,et al.  NetAffx: Affymetrix probesets and annotations , 2003, Nucleic Acids Res..

[27]  Jian Yu,et al.  Web-based interrogation of gene expression signatures using EXALT , 2009, BMC Bioinformatics.

[28]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[29]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[30]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[31]  Justin Lamb,et al.  The Connectivity Map: a new tool for biomedical research , 2007, Nature Reviews Cancer.

[32]  Marcel J. T. Reinders,et al.  A comprehensive sensitivity analysis of microarray breast cancer classification under feature variability , 2009, BMC Bioinformatics.

[33]  Ronald Fagin,et al.  Comparing Partial Rankings , 2006, SIAM J. Discret. Math..

[34]  Chun-Chi Liu,et al.  Bayesian approach to transforming public gene expression repositories into disease diagnosis databases , 2010, Proceedings of the National Academy of Sciences.

[35]  Chun Li,et al.  Strategy for encoding and comparison of gene expression signatures , 2007, Genome Biology.

[36]  T. Golub,et al.  Gene expression signature-based chemical genomic prediction identifies a novel class of HSP90 pathway modulators. , 2006, Cancer cell.

[37]  D. Critchlow Metric Methods for Analyzing Partially Ranked Data , 1986 .