Identifying gene-disease associations using centrality on a literature mined gene-interaction network

Motivation: Understanding the role of genetics in diseases is one of the most important aims of the biological sciences. The completion of the Human Genome Project has led to a rapid increase in the number of publications in this area. However, the coverage of curated databases that provide information manually extracted from the literature is limited. Another challenge is that determining disease-related genes requires laborious experiments. Therefore, predicting good candidate genes before experimental analysis will save time and effort. We introduce an automatic approach based on text mining and network analysis to predict gene-disease associations. We collected an initial set of known disease-related genes and built an interaction network by automatic literature mining based on dependency parsing and support vector machines. Our hypothesis is that the central genes in this disease-specific network are likely to be related to the disease. We used the degree, eigenvector, betweenness and closeness centrality metrics to rank the genes in the network. Results: The proposed approach can be used to extract known and to infer unknown gene-disease associations. We evaluated the approach for prostate cancer. Eigenvector and degree centrality achieved high accuracy. A total of 95% of the top 20 genes ranked by these methods are confirmed to be related to prostate cancer. On the other hand, betweenness and closeness centrality predicted more genes whose relation to the disease is currently unknown and are candidates for experimental study. Availability: A web-based system for browsing the disease-specific gene-interaction networks is available at: http://gin.ncibi.org Contact: radev@umich.edu

[1]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[2]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[3]  Jan Freudenberg,et al.  A similarity-based method for genome-wide prediction of disease-relevant human genes , 2002, ECCB.

[4]  Huei-Wen Chen,et al.  Global analysis of differentially expressed genes in endometrium with or without endometriosis using human cDNA microarray , 2002 .

[5]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[6]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[7]  D. Ingber,et al.  High-Betweenness Proteins in the Yeast Protein Interaction Network , 2005, Journal of biomedicine & biotechnology.

[8]  Changyu Shen,et al.  Mining Alzheimer Disease Relevant Proteins from Integrated Protein Interactome Data , 2005, Pacific Symposium on Biocomputing.

[9]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[10]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[11]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[12]  D. Valle,et al.  Online Mendelian Inheritance In Man (OMIM) , 2000, Human mutation.

[13]  Hisham Al-Mubaid,et al.  A New Text Mining Approach for Finding Protein-to-Disease Associations , 2005 .

[14]  Chitta Baral,et al.  Mining Gene-Disease Relationships from Biomedical Literature: Weighting Proteinprotein Interactions and Connectivity , 2006, Pacific Symposium on Biocomputing.

[15]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[16]  A Negassa,et al.  Polymorphism of the insulin gene is associated with increased prostate cancer risk , 2003, British Journal of Cancer.

[17]  Lada A. Adamic,et al.  A literature based method for identifying gene-disease connections , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[18]  Z N Oltvai,et al.  Evolutionary conservation of motif constituents in the yeast protein interaction network , 2003, Nature Genetics.

[19]  Sudhir Agrawal,et al.  Antisense therapy targeting MDM2 oncogene in prostate cancer: Effects on proliferation, apoptosis, multiple gene expression, and chemotherapy , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Feng Hao,et al.  Lysophosphatidic acid induces prostate cancer PC3 cell migration via activation of LPA(1), p42 and p38alpha. , 2007, Biochimica et biophysica acta.

[21]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[22]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[23]  Hong Zhao,et al.  PGDB: a curated and integrated database of genes related to the prostate , 2003, Nucleic Acids Res..

[24]  Sue Povey,et al.  Genew: the Human Gene Nomenclature Database , 2002, Nucleic Acids Res..

[25]  P. Bork,et al.  G2D: a tool for mining genes associated with disease , 2005, BMC Genetics.

[26]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[27]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[28]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[29]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[30]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[31]  Hasan Mukhtar,et al.  Cannabinoid Receptor Agonist-induced Apoptosis of Human Prostate Cancer Cells LNCaP Proceeds through Sustained Activation of ERK1/2 Leading to G1 Cell Cycle Arrest* , 2006, Journal of Biological Chemistry.

[32]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[33]  Mehryar Mohri,et al.  Rational Kernels: Theory and Algorithms , 2004, J. Mach. Learn. Res..

[34]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[35]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[36]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[37]  Dong Yu,et al.  Experimental therapy of human prostate cancer by inhibiting MDM2 expression with novel mixed‐backbone antisense oligonucleotides: In vitro and in vivo activities and mechanisms , 2003, The Prostate.

[38]  M. DePamphilis,et al.  HUMAN DISEASE , 1957, The Ulster Medical Journal.

[39]  P. Kemmeren,et al.  A new web-based data mining tool for the identification of candidate genes for human genetic disorders , 2003, European Journal of Human Genetics.

[40]  L. Mirny,et al.  Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Sue Povey,et al.  Genew: the Human Gene Nomenclature Database, 2004 updates , 2004, Nucleic Acids Res..

[43]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[44]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[45]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[46]  Dragomir R. Radev,et al.  Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing , 2007, EMNLP.

[47]  Gabriele Ausiello,et al.  MINT: the Molecular INTeraction database , 2006, Nucleic Acids Res..

[48]  Igor Jurisica,et al.  Online Predicted Human Interaction Database , 2005, Bioinform..

[49]  Y. Li,et al.  Global analysis of differentially expressed genes in androgen-independent prostate cancer , 2007, Prostate Cancer and Prostatic Diseases.

[50]  Matthew W. Hahn,et al.  Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. , 2005, Molecular biology and evolution.

[51]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[52]  Dragomir R. Radev,et al.  MavenRank: Identifying Influential Members of the US Senate Using Lexical Centrality , 2007, EMNLP.