Computer-assisted curation of a human regulatory core network from the biological literature

MOTIVATION A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs. RESULTS We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45 000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained ∼900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship. CONCLUSIONS We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art. AVAILABILITY AND IMPLEMENTATION Web-service is freely accessible at http://fastforward.sys-bio.net/. CONTACT leser@informatik.hu-berlin.de or nils.bluethgen@charite.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Ulf Leser,et al.  A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature , 2010, PLoS Comput. Biol..

[2]  J. Collado-Vides,et al.  Bioinformatics Resources for the Study of Gene Regulation in Bacteria , 2008, Journal of bacteriology.

[3]  Andrey N. Naumochkin,et al.  Transcription Regulatory Regions Database (TRRD): its status in 2002 , 2002, Nucleic Acids Res..

[4]  Ulrich Rückert,et al.  How Little Do We Actually Know? On the Size of Gene Regulatory Networks , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Juan M. Vaquerizas,et al.  A census of human transcription factors: function, expression and evolution , 2009, Nature Reviews Genetics.

[6]  Edgar Wingender,et al.  TFClass: an expandable hierarchical classification of human transcription factors , 2012, Nucleic Acids Res..

[7]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[8]  Obi L. Griffith,et al.  ORegAnno: an open-access community-driven resource for regulatory annotation , 2007, Nucleic Acids Res..

[9]  G. Gonye,et al.  Transcriptional regulatory network analysis during epithelial-mesenchymal transformation of retinal pigment epithelium , 2008, Molecular vision.

[10]  K. Skarstad,et al.  ChIP on Chip: surprising results are often artifacts , 2010, BMC Genomics.

[11]  Debra L. Fulton,et al.  The Transcription Factor Encyclopedia , 2012, Genome Biology.

[12]  K. Chou,et al.  Identification of Colorectal Cancer Related Genes with mRMR and Shortest Path in Protein-Protein Interaction Network , 2012, PloS one.

[13]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[14]  R. Tjian,et al.  Orchestrated response: a symphony of transcription factors for gene control. , 2000, Genes & development.

[15]  Edgar Wingender,et al.  The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation , 2008, Briefings Bioinform..

[16]  Sampo Pyysalo,et al.  BioNLP Shared Task 2011: Supporting Resources , 2011, BioNLP@ACL.

[17]  Y. Hayashizaki,et al.  Identification of an inter-transcription factor regulatory network in human hepatoma cells by Matrix RNAi , 2009, Nucleic acids research.

[18]  Zhiyong Lu,et al.  The gene normalization task in BioCreative III , 2011, BMC Bioinformatics.

[19]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[20]  R. Sharan,et al.  Protein networks in disease. , 2008, Genome research.

[21]  Claudio Giuliano,et al.  Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature , 2006, EACL.

[22]  E. Davidson,et al.  Gene Regulatory Networks and the Evolution of Animal Body Plans , 2006, Science.

[23]  Alexander E. Kel,et al.  Transcription Regulatory Regions Database (TRRD): its status in 1999 , 1999, Nucleic Acids Res..

[24]  Michael Schroeder,et al.  Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes , 2012, PLoS Comput. Biol..

[25]  Ulf Leser,et al.  A detailed error analysis of 13 kernel methods for protein–protein interaction extraction , 2013, BMC Bioinformatics.

[26]  Junjun Zhang,et al.  BioMart Central Portal—unified access to biological data , 2009, Nucleic Acids Res..

[27]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[28]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[29]  Elena Beisswanger,et al.  The GeneReg Corpus for Gene Expression Regulation Events — An Overview of the Corpus and its In-Domain and Out-of-Domain Interoperability , 2010, LREC.

[30]  Mauno Vihinen,et al.  Identification of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeficiencies , 2008, Nucleic acids research.

[31]  Goran Nenadic,et al.  The GNAT library for local and remote gene mention normalization , 2011, Bioinform..

[32]  C. Niehrs,et al.  Synexpression groups in eukaryotes , 1999, Nature.

[33]  Sebastian Wernicke,et al.  FANMOD: a tool for fast network motif detection , 2006, Bioinform..

[34]  S. Mangan,et al.  Structure and function of the feed-forward loop network motif , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Doron Lancet,et al.  MalaCards: an integrated compendium for diseases and their annotation , 2013, Database J. Biol. Databases Curation.

[36]  T. Ideker,et al.  Differential network biology , 2012, Molecular systems biology.

[37]  T. Hampton,et al.  The Cancer Genome Atlas , 2020, Indian Journal of Medical and Paediatric Oncology.

[38]  S. Shen-Orr,et al.  Network motifs in the transcriptional regulation network of Escherichia coli , 2002, Nature Genetics.

[39]  S. Horvath,et al.  Weighted gene coexpression network analysis strategies applied to mouse weight , 2007, Mammalian Genome.

[40]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[41]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[42]  Bertram Klinger,et al.  Reverse engineering a hierarchical regulatory network downstream of oncogenic KRAS , 2012, Molecular systems biology.

[43]  Peter M. Schlag,et al.  Identification of Y-Box Binding Protein 1 As a Core Regulator of MEK/ERK Pathway-Dependent Gene Signatures in Colorectal Cancer Cells , 2010, PLoS genetics.

[44]  David Warde-Farley,et al.  Dynamic modularity in protein interaction networks predicts breast cancer outcome , 2009, Nature Biotechnology.

[45]  Chi V Dang,et al.  MYC on the Path to Cancer , 2012, Cell.

[46]  T. Furey ChIP – seq and beyond : new and improved methodologies to detect and characterize protein – DNA interactions , 2012 .

[47]  Julio Collado-Vides,et al.  RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation , 2007, Nucleic Acids Res..