On Building a Quantitative Food-Disease-Gene Network

Nutritional genomics is a new science that studies the relationship between foods (or nutrients), diseases, and genes. Large amounts of scientific findings have been published in this area, primarily in unstructured text. Moreover, given a pair of entities, different studies can report different findings. It is hence important to obtain a holistic view of the reported relationships. In this article, we describe an information extraction system aiming to reach this goal. The system integrates natural language processing techniques, domain ontology, statistical, and machine learning methods. It consists of four main modules: (1) entity extraction, which recognizes and extracts five types of entities: foods, chemicals (or nutrients), diseases, proteins and genes; (2) relationship extraction, which extracts binary relationships between entities; (3) relationship polarity analysis, which categorizes relationships into three groups: positive, negative, and neutral; and (4) strength analysis, which rates a relationship as weak, medium, or strong. To the best of our knowledge, we are the first to propose to analyze the polarity and strength of a binary relationship. We have evaluated our system using the GENIA corpus and datasets drawn from the MEDLINE database. The first two modules outperform the reported best results with an average F-score of 0.89 and 0.82, respectively; while the last two also achieve promising results with an accuracy of 0.75-0.84 and ~0.90, respectively. 1 INTRODUCTION Advances in biotechnology and life sciences are leading to an ever-increasing volume of published research data, predominantly in unstructured text (or natural language). At the time of writing, the MEDLINE database consists of 19 million scientific articles with a growth rate of ~400,000 articles per year [8]. This phenomenon becomes even more apparent in nutritional genomics, an emerging new science that studies the relationship between foods (or nutrients), diseases, and genes [16]. For instance, soy products and green tea have been two of the intensively studied foods in this new discipline due to their controversial relationship with cancer. A search to the MEDLINE database on " soy and cancer " renders a total of 1,287 articles, and a search on " green tea and cancer " renders 1,318 articles. Due to the large number of publications every year, it is unrealistic for even the most motivated to manually go through these articles to obtain a full picture of the findings reported to date. This however has become ever more important and necessary due to the following reasons: (1) given a …

[1]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[2]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[3]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[4]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[5]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[6]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[7]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[8]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[9]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[10]  R. Rodriguez,et al.  Nutritional genomics: the next frontier in the postgenomic era , 2003 .

[11]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[12]  Janyce Wiebe,et al.  Just How Mad Are You? Finding Strong and Weak Opinion Clauses , 2004, AAAI.

[13]  J. Berman Pathology abbreviated: a long review of short terms. , 2009, Archives of pathology & laboratory medicine.

[14]  Andre Skusa,et al.  Extraction of biological interaction networks from scientific literature , 2005, Briefings Bioinform..

[15]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[16]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[17]  Catherine Sauvaget,et al.  Lifestyle Factors, Radiation and Gastric Cancer in Atomic-Bomb Survivors (Japan) , 2005, Cancer Causes & Control.

[18]  G. Maskarinec,et al.  Urinary Sex Steroid Excretion Levels During a Soy Intervention Among Young Girls: A Pilot Study , 2005, Nutrition and cancer.

[19]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[20]  G. Sonn,et al.  Impact of diet on prostate cancer: a review , 2005, Prostate Cancer and Prostatic Diseases.

[21]  U Kragl,et al.  Flax-seed extracts with phytoestrogenic effects on a hormone receptor-positive tumour cell line. , 2005, Anticancer research.

[22]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[23]  Ronen Feldman,et al.  Mining biomedical literature using information extraction , .