Connecting the Dots between PubMed Abstracts

Background There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and diseases. Each article investigates subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must integrate information from multiple publications. Particularly, unraveling relationships between extra-cellular inputs and downstream molecular response mechanisms requires integrating conclusions from diverse publications. Methodology We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for “connecting the dots” across the literature. We describe a storytelling algorithm that, given a start and end publication, typically with little or no overlap in content, identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. The quality of discovered stories is measured using local criteria such as the size of supporting neighborhoods for each link and the strength of individual links connecting publications, as well as global metrics of dispersion. To ensure that the story stays coherent as it meanders from one publication to another, we demonstrate the design of novel coherence and overlap filters for use as post-processing steps. Conclusions We demonstrate the application of our storytelling algorithm to three case studies: i) a many-one study exploring relationships between multiple cellular inputs and a molecule responsible for cell-fate decisions, ii) a many-many study exploring the relationships between multiple cytokines and multiple downstream transcription factors, and iii) a one-to-one study to showcase the ability to recover a cancer related association, viz. the Warburg effect, from past literature. The storytelling pipeline helps narrow down a scientist's focus from several hundreds of thousands of relevant documents to only around a hundred stories. We argue that our approach can serve as a valuable discovery aid for hypothesis generation and connection exploration in large unstructured biological knowledge bases.

[1]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[2]  Don R. Swanson,et al.  A second example of mutually isolated medical literatures related by implicit, unnoticed connections , 1989, JASIS.

[3]  Don R. Swanson,et al.  Complementary structures in disjoint science literatures , 1991, SIGIR '91.

[4]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[5]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[6]  Lawrence Hunter,et al.  Mining molecular binding terminology from biomedical text , 1999, AMIA.

[7]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[8]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[9]  Vasileios Hatzivassiloglou,et al.  Learning anchor verbs for biological interaction patterns from published text articles , 2002, Int. J. Medical Informatics.

[10]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[11]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[12]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[13]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[14]  Hong Yu,et al.  Extracting synonymous gene and protein terms from biological literature , 2003, ISMB.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Hsinchun Chen,et al.  A shallow parser based on closed-class words to capture relations in biomedical text , 2003, J. Biomed. Informatics.

[17]  Vladimir B. Bajic,et al.  Dragon TF Association Miner: a system for exploring transcription factor associations through text-mining , 2004, Nucleic Acids Res..

[18]  Deept Kumar,et al.  Turning CARTwheels: an alternating algorithm for mining redescriptions , 2003, KDD.

[19]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[20]  Xiaoyan Zhu,et al.  PathwayFinder: Paving the Way Towards Automatic Pathway Extraction , 2004, APBC.

[21]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[22]  H. Ha Defective transcription factor activation for proinflammatory gene expression in poly(ADP-ribose) polymerase 1-deficient glia. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Erik M. van Mulligen,et al.  Constructing an associative concept space for literature-based discovery , 2004, J. Assoc. Inf. Sci. Technol..

[24]  Padmini Srinivasan,et al.  Mining MEDLINE for implicit links between dietary substances and diseases , 2004, ISMB/ECCB.

[25]  David Wheeler,et al.  Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils) , 2004 .

[26]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[27]  Jonathan D. Wren,et al.  Knowledge discovery by automated identification and ranking of implicit relationships , 2004, Bioinform..

[28]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[29]  Hsinchun Chen,et al.  Extracting gene pathway relations using a hybrid grammar: the Arizona Relation Parser , 2004, Bioinform..

[30]  Anne E Carpenter,et al.  Systematic genome-wide screens of gene function , 2004, Nature Reviews Genetics.

[31]  Naren Ramakrishnan,et al.  Redescription Mining: Structure Theory and Algorithms , 2005, AAAI.

[32]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[33]  Erik M. van Mulligen,et al.  Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes , 2005, Bioinform..

[34]  Naren Ramakrishnan,et al.  Transcriptional Response of Saccharomyces cerevisiae to Desiccation and Rehydration , 2005, Applied and Environmental Microbiology.

[35]  Andre Skusa,et al.  Extraction of biological interaction networks from scientific literature , 2005, Briefings Bioinform..

[36]  Aldo Gangemi,et al.  Unsupervised Learning of Semantic Relations between Concepts of a Molecular Biology Ontology , 2005, IJCAI.

[37]  Christian Blaschke,et al.  Text Mining for Metabolic Pathways, Signaling Cascades, and Protein Networks , 2005, Science's STKE.

[38]  C. Boschek,et al.  Pyruvate kinase type M2 and its role in tumor growth and spreading. , 2005, Seminars in cancer biology.

[39]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[40]  Naren Ramakrishnan,et al.  Reasoning about sets using redescription mining , 2005, KDD '05.

[41]  Rolf Apweiler,et al.  Linking publication, gene and protein data , 2006, Nature Cell Biology.

[42]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[43]  Nicotinamide extends the replicative life span of primary human cells , 2006, Mechanisms of Ageing and Development.

[44]  Christopher J. Rawlings,et al.  Graph-based analysis and visualization of experimental results with ONDEX , 2006, Bioinform..

[45]  M. Hottiger,et al.  Nuclear ADP-Ribosylation Reactions in Mammalian Cells: Where Are We Today and Where Are We Going? , 2006, Microbiology and Molecular Biology Reviews.

[46]  Peer Bork,et al.  Extraction of regulatory gene/protein networks from Medline , 2006, Bioinform..

[47]  H. T. Kang,et al.  Nicotinamide extends replicative lifespan of human cells , 2006, Aging cell.

[48]  Naren Ramakrishnan,et al.  BLOSOM: a framework for mining arbitrary boolean expressions , 2006, KDD '06.

[49]  R. Helm,et al.  Activated stress response pathways within multicellular aggregates utilize an autocrine component. , 2007, Cellular signalling.

[50]  O. Combarros,et al.  Interaction between Poly(ADP-Ribose) Polymerase 1 and Interleukin 1A Genes Is Associated with Alzheimer’s Disease Risk , 2007, Dementia and Geriatric Cognitive Disorders.

[51]  R. Deberardinis,et al.  Beyond aerobic glycolysis: Transformed cells can engage in glutamine metabolism that exceeds the requirement for protein and nucleotide synthesis , 2007, Proceedings of the National Academy of Sciences.

[52]  B. Carpenter,et al.  LingPipe for 99.99% Recall of Gene Mentions , 2007 .

[53]  Thorsten Joachims,et al.  Information genealogy: uncovering the flow of ideas in non-hyperlinked document databases , 2007, KDD '07.

[54]  T. M. Murali,et al.  Compositional mining of multirelational biological datasets , 2008, TKDD.

[55]  Deyu Zhou,et al.  Methodological Review: Extracting interactions between proteins from the literature , 2008 .

[56]  Amit P. Sheth,et al.  Unsupervised Discovery of Compound Entities for Relationship Extraction , 2008, EKAW.

[57]  Amar K. Das,et al.  Unsupervised Method for Automatic Construction of a Disease Dictionary from a Large Free Text Collection , 2008, AMIA.

[58]  Ru Wei,et al.  The M2 splice isoform of pyruvate kinase is important for cancer metabolism and tumour growth , 2008, Nature.

[59]  Aldo Gangemi,et al.  Unsupervised Learning of Semantic Relations for Molecular Biology Ontologies , 2008, Ontology Learning and Population.

[60]  Lefteris Angelis,et al.  PuReD-MCL: a graph-based PubMed document clustering methodology , 2008, Bioinform..

[61]  Naren Ramakrishnan,et al.  CMGSDB: integrating heterogeneous Caenorhabditis elegans data sources using compositional data mining , 2007, Nucleic Acids Res..

[62]  Naren Ramakrishnan,et al.  Algorithms for Storytelling , 2006, IEEE Transactions on Knowledge and Data Engineering.

[63]  H. Christofk,et al.  Pyruvate kinase M2 is a phosphotyrosine-binding protein , 2008, Nature.

[64]  L. Cantley,et al.  Understanding the Warburg Effect: The Metabolic Requirements of Cell Proliferation , 2009, Science.

[65]  Jing Chen,et al.  Tyrosine Phosphorylation Inhibits PKM2 to Promote the Warburg Effect and Tumor Growth , 2009, Science Signaling.

[66]  C. Dang PKM2 Tyrosine Phosphorylation and Glutamine Metabolism Signal a Different View of the Warburg Effect , 2009, Science Signaling.

[67]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[68]  Maria C. Mitterberger,et al.  Pyruvate kinase isoenzyme M2 is a glycolytic sensor differentially regulating cell proliferation, cell size and apoptotic cell death dependent on glucose supply. , 2009, Experimental cell research.

[69]  H. T. Kang,et al.  Nicotinamide enhances mitochondria quality through autophagy activation in human cells , 2009, Aging cell.

[70]  Martijn J. Schuemie,et al.  Novel Protein-Protein Interactions Inferred from Literature Context , 2009, PloS one.

[71]  Chitta Baral,et al.  Discovering drug–drug interactions: a text-mining and reasoning approach based on properties of drug metabolism , 2010, Bioinform..

[72]  Russ B. Altman,et al.  Author ' s personal copy Using text to build semantic networks for pharmacogenomics , 2010 .

[73]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[74]  Stephen Anthony,et al.  BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs , 2011, BMC Bioinformatics.

[75]  Martin Krallinger,et al.  Analysis of biological processes and diseases using text mining approaches. , 2010, Methods in molecular biology.

[76]  H. V. Jagadish,et al.  Literature-based discovery of diabetes- and ROS-related targets , 2010, BMC Medical Genomics.

[77]  Russ B. Altman,et al.  Integration and publication of heterogeneous text-mined relationships on the Semantic Web , 2011, J. Biomed. Semant..

[78]  A. Körner,et al.  Nampt and its potential role in inflammation and type 2 diabetes. , 2011, Handbook of experimental pharmacology.

[79]  S. Barry,et al.  A High-Throughput Platform for Lentiviral Overexpression Screening of the Human ORFeome , 2011, PloS one.

[80]  A. Liekens,et al.  BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation , 2011, Genome Biology.

[81]  Derek Partridge The Knowledge Web , 2014 .

[82]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .