Integrative mining of traditional Chinese medicine literature and MEDLINE for functional gene networks

OBJECTIVE The amount of biomedical data in different disciplines is growing at an exponential rate. Integrating these significant knowledge sources to generate novel hypotheses for systems biology research is difficult. Traditional Chinese medicine (TCM) is a completely different discipline, and is a complementary knowledge system to modern biomedical science. This paper uses a significant TCM bibliographic literature database in China, together with MEDLINE, to help discover novel gene functional knowledge. MATERIALS AND METHODS We present an integrative mining approach to uncover the functional gene relationships from MEDLINE and TCM bibliographic literature. This paper introduces TCM literature (about 50,000 records) as one knowledge source for constructing literature-based gene networks. We use the TCM diagnosis, TCM syndrome, to automatically congregate the related genes. The syndrome-gene relationships are discovered based on the syndrome-disease relationships extracted from TCM literature and the disease-gene relationships in MEDLINE. Based on the bubble-bootstrapping and relation weight computing methods, we have developed a prototype system called MeDisco/3S, which has name entity and relation extraction, and online analytical processing (OLAP) capabilities, to perform the integrative mining process. RESULTS We have got about 200,000 syndrome-gene relations, which could help generate syndrome-based gene networks, and help analyze the functional knowledge of genes from syndrome perspective. We take the gene network of Kidney-Yang Deficiency syndrome (KYD syndrome) and the functional analysis of some genes, such as CRH (corticotropin releasing hormone), PTH (parathyroid hormone), PRL (prolactin), BRCA1 (breast cancer 1, early onset) and BRCA2 (breast cancer 2, early onset), to demonstrate the preliminary results. The underlying hypothesis is that the related genes of the same syndrome will have some biological functional relationships, and will constitute a functional network. CONCLUSION This paper presents an approach to integrate TCM literature and modern biomedical data to discover novel gene networks and functional knowledge of genes. The preliminary results show that the novel gene functional knowledge and gene networks, which are worthy of further investigation, could be generated by integrating the two complementary biomedical data sources. It will be a promising research field through integrative mining of TCM and modern life science literature.

[1]  Eugene Agichtein,et al.  Combining Text Mining and Sequence Analysis to Discover Protein Functional Regions , 2003, Pacific Symposium on Biocomputing.

[2]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[3]  Michael J. E. Sternberg,et al.  Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines , 2001, Pacific Symposium on Biocomputing.

[4]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[5]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[6]  T. Ideker,et al.  A new approach to decoding life: systems biology. , 2001, Annual review of genomics and human genetics.

[7]  Francesco Pinciroli,et al.  GFINDer: genetic disease and phenotype location statistical analysis and mining of dynamically annotated gene lists , 2005, Nucleic Acids Res..

[8]  Ulf Leser,et al.  Systematic feature evaluation for gene name recognition , 2005, BMC Bioinformatics.

[9]  Gert Vriend,et al.  GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases , 2005, Nucleic Acids Res..

[10]  Junli Chen,et al.  Text Mining for Finding Functional Community of Related Genes Using TCM Knowledge , 2004, PKDD.

[11]  George Hripcsak,et al.  Gene symbol disambiguation using knowledge-based profiles , 2007, Bioinform..

[12]  A. Cornish-Bowden,et al.  Systems biology may work when we learn to understand the parts in terms of the whole. , 2005, Biochemical Society transactions.

[13]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[14]  P. Bork,et al.  G2D: a tool for mining genes associated with disease , 2005, BMC Genetics.

[15]  William R. Hersh,et al.  Evaluation of biomedical text-mining systems: Lessons learned from information retrieval , 2005, Briefings Bioinform..

[16]  Jan Freudenberg,et al.  A similarity-based method for genome-wide prediction of disease-relevant human genes , 2002, ECCB.

[17]  Sergei Nirenburg Proceedings of the sixth conference on Applied natural language processing , 2000 .

[18]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[19]  Mou Hui-qi Advancement of the Treatment Methed Proposed by Zhang Zhong-jing , 2006 .

[20]  Xiaohua Hu,et al.  Data Mining and Predictive Modeling of Biomolecular Network from Biomedical Literature Databases , 2007, TCBB.

[21]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[22]  Razvan C. Bunescu,et al.  Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline , 2006, BioNLP@NAACL-HLT.

[23]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[24]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[25]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[26]  L. Zhong,et al.  [Effect of three kinds (tonifying kidney, invigorating spleen, promoting blood circulation) recipes on the hypothalamus-pituitary-adrenal-thymus (HPAT) axis and CRF gene expression]. , 1997, Zhongguo Zhong xi yi jie he za zhi Zhongguo Zhongxiyi jiehe zazhi = Chinese journal of integrated traditional and Western medicine.

[27]  Michael D. Gordon,et al.  Toward Discovery Support Systems: A Replication, Re-Examination, and Extension of Swanson's Work on Literature-Based Discovery of a Connection between Raynaud's and Fish Oil , 1996, J. Am. Soc. Inf. Sci..

[28]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[29]  Hsiao-Tieh Pu,et al.  Important Issues on Chinese Information Retrieval , 1996, Int. J. Comput. Linguistics Chin. Lang. Process..

[30]  Ellen Riloff Bootstrapping for text learning tasks , 1999 .

[31]  Javed Mostafa,et al.  Detecting Gene Relations from MEDLINE Abstracts , 2000, Pacific Symposium on Biocomputing.

[32]  Changyu Shen,et al.  Mining Alzheimer Disease Relevant Proteins from Integrated Protein Interactome Data , 2005, Pacific Symposium on Biocomputing.

[33]  Hagit Shatkay,et al.  Significantly Improved Prediction of Subcellular Localization by Integrating Text and Protein Sequence Data , 2005, Pacific Symposium on Biocomputing.

[34]  Thomas Werner,et al.  The next generation of literature analysis: Integration of genomic analysis into text mining , 2005, Briefings Bioinform..

[35]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[36]  Marc Weeber,et al.  Text-based discovery in biomedicine: the architecture of the DAD-system , 2000, AMIA.

[37]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[38]  Michael D. Gordon,et al.  Literature-based discovery by lexical statistics , 1999 .

[39]  William H. Majoros,et al.  Genomics and natural language processing , 2002, Nature Reviews Genetics.

[40]  Lada A. Adamic,et al.  A literature based method for identifying gene-disease connections , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[41]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[42]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[43]  Don R. Swanson,et al.  Complementary structures in disjoint science literatures , 1991, SIGIR '91.

[44]  Don R. Swanson,et al.  Two medical literatures that are logically but not bibliographically connected , 1987, J. Am. Soc. Inf. Sci..

[45]  Lawrence Hunter,et al.  Extracting Molecular Binding Relationships from Biomedical Text , 2000, ANLP.

[46]  Joel D. Martin,et al.  Getting to the (c)ore of knowledge: mining biomedical literature , 2002, Int. J. Medical Informatics.

[47]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[48]  Zhaohui Wu,et al.  Knowledge discovery in traditional Chinese medicine: State of the art and perspectives , 2006, Artif. Intell. Medicine.

[49]  Shailesh V. Date,et al.  A Probabilistic Functional Network of Yeast Genes , 2004, Science.

[50]  Bart De Moor,et al.  Meta-clustering of gene expression data and literature-based information , 2003, SKDD.

[51]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[52]  Neil R. Smalheiser,et al.  Artificial Intelligence An interactive system for finding complementary literatures : a stimulus to scientific discovery , 1995 .

[53]  Dennis M. Wilkinson,et al.  A method for finding communities of related genes , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Ying Liu,et al.  Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[55]  Li Kun Study on syndrome characteristics of Chinese medicine and relative factors in patients with DM , 2006 .

[56]  Bart De Moor,et al.  Using literature and data to learn Bayesian networks as clinical models of ovarian tumors , 2004, Artif. Intell. Medicine.

[57]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[58]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[59]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[60]  Xuegong Zhang,et al.  Understanding ZHENG in traditional Chinese medicine in the context of neuro-endocrine-immune network. , 2007, IET systems biology.

[61]  Teruyoshi Hishiki,et al.  Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning , 2005, Pacific Symposium on Biocomputing.

[62]  G. Vriend,et al.  A text-mining analysis of the human phenome , 2006, European Journal of Human Genetics.

[63]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from the literature: Part II , 2005, Bioinform..