Literature-aided interpretation of gene expression data with the weighted global test

Most methods for the interpretation of gene expression profiling experiments rely on the categorization of genes, as provided by the Gene Ontology (GO) and pathway databases. Due to the manual curation process, such databases are never up-to-date and tend to be limited in focus and coverage. Automated literature mining tools provide an attractive, alternative approach. We review how they can be employed for the interpretation of gene expression profiling experiments. We illustrate that their comprehensive scope aids the interpretation of data from domains poorly covered by GO or alternative databases, and allows for the linking of gene expression with diseases, drugs, tissues and other types of concepts. A framework for proper statistical evaluation of the associations between gene expression values and literature concepts was lacking and is now implemented in a weighted extension of global test. The weights are the literature association scores and reflect the importance of a gene for the concept of interest. In a direct comparison with classical GO-based gene sets, we show that use of literature-based associations results in the identification of much more specific GO categories. We demonstrate the possibilities for linking of gene expression data to patient survival in breast cancer and the action and metabolism of drugs. Coupling with online literature mining tools ensures transparency and allows further study of the identified associations. Literature mining tools are therefore powerful additions to the toolbox for the interpretation of high-throughput genomics data.

[1]  Rob Jelier,et al.  CoPub Mapper: mining MEDLINE based on search term co-publication , 2005, BMC Bioinformatics.

[2]  Miguel A. Andrade-Navarro,et al.  LAITOR - Literature Assistant for Identification of Terms co-Occurrences and Relationships , 2010, BMC Bioinformatics.

[3]  F Extramiana,et al.  Homozygous SCN5A Mutation in Long-QT Syndrome With Functional Two-to-One Atrioventricular Block , 2001, Circulation research.

[4]  Allen C. Browne,et al.  Lexical methods for managing variation in biomedical terminologies. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[5]  Martijn J. Schuemie,et al.  Literature-based concept profiles for gene annotation: The issue of weighting , 2008, Int. J. Medical Informatics.

[6]  Ulrich Mansmann,et al.  GlobalANCOVA: exploration and assessment of gene group effects , 2008, Bioinform..

[7]  Jeffrey T. Chang,et al.  The computational analysis of scientific literature to define and recognize gene expression clusters. , 2003, Nucleic acids research.

[8]  J. Kroon,et al.  Identification and functional expression of a type 2 acyl-CoA:diacylglycerol acyltransferase (DGAT2) in developing castor bean seeds which has high homology to the major triglyceride biosynthetic enzyme of fungi and animals. , 2006, Phytochemistry.

[9]  A. Moorman,et al.  Tbx3 controls the sinoatrial node gene program and imposes pacemaker function on the atria. , 2007, Genes & development.

[10]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[11]  Hagit Shatkay,et al.  Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis , 2000, ISMB.

[12]  Purvesh Khatri,et al.  A semantic analysis of the annotations of the human genome , 2005, Bioinform..

[13]  Jin Zhao,et al.  GenCLiP: a software program for clustering gene lists by literature profiling and constructing gene co-occurrence networks related to custom keywords , 2008, BMC Bioinformatics.

[14]  Barend Mons,et al.  Assignment of protein function and discovery of novel nucleolar proteins based on automatic analysis of MEDLINE , 2007, Proteomics.

[15]  Sunil Singhal,et al.  A 10-Gene Classifier for Distinguishing Head and Neck Squamous Cell Carcinoma and Lung Squamous Cell Carcinoma , 2007, Clinical Cancer Research.

[16]  Michael Müller,et al.  PPARalpha-mediated effects of dietary lipids on intestinal barrier gene expression , 2008, BMC Genomics.

[17]  Peter A. C. 't Hoen,et al.  Literature-aided meta-analysis of microarray data: a compendium study on muscle development and disease , 2008, BMC Bioinformatics.

[18]  David Kipling,et al.  Text-based over-representation analysis of microarray gene lists with annotation bias , 2009, Nucleic acids research.

[19]  Sue Povey,et al.  The HGNC Database in 2008: a resource for the human genome , 2007, Nucleic Acids Res..

[20]  Lawrence Hunter,et al.  Biomedical Discovery Acceleration, with Applications to Craniofacial Development , 2009, PLoS Comput. Biol..

[21]  Gary D Bader,et al.  NetPath: a public resource of curated signal transduction pathways , 2010, Genome Biology.

[22]  H. V. Jagadish,et al.  ConceptGen: a gene set enrichment and gene set relation mapping tool , 2010, Bioinform..

[23]  Ming-Feng Hou,et al.  Combination of Multiple mRNA Markers (PTTG1, Survivin, UbcH10 and TK1) in the Diagnosis of Taiwanese Patients with Breast Cancer by Membrane Array , 2007, Oncology.

[24]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[25]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[26]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[27]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[28]  Reinhard Schneider,et al.  Martini: using literature keywords to compare gene sets , 2009, Nucleic acids research.

[29]  Tae-You Kim,et al.  Gene silencing of TSPYL5 mediated by aberrant promoter methylation in gastric cancers , 2008, Laboratory Investigation.

[30]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[31]  Joaquín Dopazo,et al.  Functional profiling of microarray experiments using text-mining derived bioentities , 2007, Bioinform..

[32]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Xin He,et al.  Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model , 2009, BMC Bioinformatics.

[34]  Martijn J. Schuemie,et al.  A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[35]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[36]  P. Febbo,et al.  Literature Lab: a method of automated literature interrogation to infer biology from microarray analysis , 2007, BMC Genomics.

[37]  Maurice Bouwhuis,et al.  CoPub: a literature-based keyword enrichment tool for microarray data analysis , 2008, Nucleic Acids Res..

[38]  D. Chaussabel,et al.  Mining microarray expression data by literature profiling , 2002, Genome Biology.

[39]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[40]  R. Vossen,et al.  Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms , 2008, Nucleic acids research.

[41]  David L. Paul,et al.  Mice lacking connexin40 have cardiac conduction abnormalities characteristic of atrioventricular block and bundle branch block , 1998, Current Biology.

[42]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[43]  Barend Mons,et al.  Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation , 2007, BMC Bioinformatics.

[44]  Padmini Srinivasan,et al.  Text mining: Generating hypotheses from MEDLINE , 2004, J. Assoc. Inf. Sci. Technol..

[45]  Ulrich Mansmann,et al.  Multiple testing on the directed acyclic graph of gene ontology , 2008, Bioinform..

[46]  Li Fu,et al.  Oncogenic function of microtubule end‐binding protein 1 in breast cancer , 2010, The Journal of pathology.

[47]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[48]  Itamar Simon,et al.  MILANO – custom annotation of microarray results using automatic literature searches , 2005, BMC Bioinformatics.

[49]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[50]  Bastiaan J. Boukens,et al.  Transcription Factor Tbx3 Is Required for the Specification of the Atrioventricular Conduction System , 2008, Circulation research.

[51]  B. De Moor,et al.  TXTGate: profiling gene groups with text-based information , 2004, Genome Biology.

[52]  Ralf Zimmer,et al.  Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts , 2005, ECCB/JBI.

[53]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[54]  J. Falck,et al.  Roles of the cytochrome P450 arachidonic acid monooxygenases in the control of systemic blood pressure and experimental hypertension. , 2007, Kidney international.

[55]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[56]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[57]  Jonathan D. Wren,et al.  Clustering microarray-derived gene lists through implicit literature relationships , 2007, Bioinform..

[58]  John T. Wei,et al.  Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. , 2005, Cancer cell.

[59]  A. Valencia,et al.  Mining functional information associated with expression arrays , 2001, Functional & Integrative Genomics.

[60]  Michael W. Berry,et al.  Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization) , 2010, BMC Bioinformatics.

[61]  Jelle J. Goeman,et al.  Testing association of a pathway with survival using gene expression data , 2005, Bioinform..

[62]  Sara van de Geer,et al.  Testing against a high dimensional alternative , 2006 .

[63]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[64]  P. Morris,et al.  Ixabepilone and other epothilones: microtubule-targeting agents for metastatic breast cancer. , 2009, Clinical advances in hematology & oncology : H&O.

[65]  M. Schuemie,et al.  Anni 2.0: a multipurpose text-mining tool for the life sciences , 2008, Genome Biology.

[66]  V. Arango,et al.  Using the Gene Ontology for Microarray Data Mining: A Comparison of Methods and Application to Age Effects in Human Prefrontal Cortex , 2004, Neurochemical Research.

[67]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[68]  A. Moorman,et al.  Gene Expression Profiling of the Forming Atrioventricular Node Using a Novel Tbx3-Based Node-Specific Transgenic Reporter , 2009, Circulation research.

[69]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..