Towards an obesity-cancer knowledge base: Biomedical entity identification and relation detection

Obesity is associated with increased risks of various types of cancer, as well as a wide range of other chronic diseases. On the other hand, access to health information activates patient participation, and improve their health outcomes. However, existing online information on obesity and its relationship to cancer is heterogeneous ranging from pre-clinical models and case studies to mere hypothesis-based scientific arguments. A formal knowledge representation (i.e., a semantic knowledge base) would help better organizing and delivering quality health information related to obesity and cancer that consumers need. Nevertheless, current ontologies describing obesity, cancer and related entities are not designed to guide automatic knowledge base construction from heterogeneous information sources. Thus, in this paper, we present methods for named-entity recognition (NER) to extract biomedical entities from scholarly articles and for detecting if two biomedical entities are related, with the long term goal of building a obesity-cancer knowledge base. We leverage both linguistic and statistical approaches in the NER task, which supersedes the state-of-the-art results. Further, based on statistical features extracted from the sentences, our method for relation detection obtains an accuracy of 99.3% and a f-measure of 0.993.

[1]  T. Pischon,et al.  Body fatness, related biomarkers and cancer risk: an epidemiological perspective , 2015, Hormone molecular biology and clinical investigation.

[2]  Liz Sonenberg,et al.  Domain ontology driven data mining: a medical case study , 2007, DDDM '07.

[3]  Zaiqing Nie,et al.  Joint Entity Recognition and Disambiguation , 2015, EMNLP.

[4]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[5]  Maguelonne Teisseire,et al.  Biomedical term extraction: overview and a new methodology , 2015, Information Retrieval Journal.

[6]  Nanda Kambhatla,et al.  Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Information Extraction , 2004, ACL.

[7]  Stéphane M. Meystre,et al.  Automated concept and relationship extraction for the semi-automated ontology management (SEAM) system , 2015, J. Biomed. Semant..

[8]  Huang Xun,et al.  A Review of Relation Extraction , 2013 .

[9]  Johanna Völker,et al.  A Framework for Ontology Learning and Data-driven Change Discovery , 2005 .

[10]  Jun Zhao,et al.  Relation Classification via Convolutional Deep Neural Network , 2014, COLING.

[11]  R. Kaaks,et al.  Lifestyle and Cancer Risk , 2015, Cancer journal.

[12]  G. Colditz,et al.  Obesity and Cancer , 2010, The oncologist.

[13]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[14]  E. Calle,et al.  Overweight, obesity and cancer: epidemiological evidence and proposed mechanisms , 2004, Nature Reviews Cancer.

[15]  J. Ioannidis,et al.  Prevention and management of non-communicable disease: the IOC consensus statement, Lausanne 2013 , 2013, British Journal of Sports Medicine.

[16]  Bowen Zhou,et al.  Medical Synonym Extraction with Concept Space Models , 2015, IJCAI.

[17]  Maria Skeppstedt,et al.  Synonym extraction and abbreviation expansion with ensembles of semantic spaces , 2014, Journal of Biomedical Semantics.

[18]  Shinsuke Mori,et al.  Domain Specific Named Entity Recognition Referring to the Real World by Deep Neural Networks , 2016, ACL.

[19]  David Lo,et al.  A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[20]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[21]  José Luís Oliveira,et al.  BeCAS: biomedical concept recognition services and visualization , 2013, Bioinform..

[22]  Hongfei Lin,et al.  Drug name recognition in biomedical texts: a machine-learning-based method. , 2014, Drug discovery today.

[23]  Takashi Chikayama,et al.  Simple Customization of Recursive Neural Networks for Semantic Relation Classification , 2013, EMNLP.

[24]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[25]  Sunil Kumar Sahu,et al.  Recurrent neural network models for disease name recognition using domain invariant features , 2016, ACL.

[26]  Danushka Bollegala,et al.  Relational duality: unsupervised extraction of semantic relations between entities on the web , 2010, WWW '10.

[27]  Fang Kong,et al.  Exploiting Constituent Dependencies for Tree Kernel-Based Semantic Relation Extraction , 2008, COLING.

[28]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[29]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[30]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[31]  Gerhard Weikum,et al.  Combining linguistic and statistical analysis to extract relations from web documents , 2006, KDD '06.

[32]  Micha Elsner,et al.  Structured Generative Models for Unsupervised Named-Entity Clustering , 2009, HLT-NAACL.

[33]  Timothy Baldwin,et al.  Bootstrapped Text-level Named Entity Recognition for Literature , 2016, ACL.

[34]  Pierre Zweigenbaum,et al.  Automatic extraction of semantic relations between medical entities: a rule based approach , 2011, J. Biomed. Semant..

[35]  Jean Charlet,et al.  Building an ontology of pulmonary diseases with natural language processing tools using textual corpora , 2007, Int. J. Medical Informatics.

[36]  C G Chute,et al.  Effectiveness of Lexico-syntactic Pattern Matching for Ontology Enrichment with Clinical Documents , 2010, Methods of Information in Medicine.

[37]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[38]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[39]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[40]  Abeed Sarker,et al.  Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features , 2015, J. Am. Medical Informatics Assoc..

[41]  Paola Velardi,et al.  A Taxonomy Learning Method and Its Application to Characterize a Scientific Web Community , 2007, IEEE Transactions on Knowledge and Data Engineering.

[42]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[43]  Daniel Sánchez-Cisneros,et al.  UEM-UC3M: An Ontology-based named entity recognition system for biomedical texts. , 2013, *SEMEVAL.

[44]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[45]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[46]  Alberto Lavelli,et al.  Disease Mention Recognition with Specific Features , 2010, BioNLP@ACL.

[47]  Sophia Ananiadou,et al.  Boosting drug named entity recognition using an aggregate classifier , 2015, Artif. Intell. Medicine.

[48]  Stan Matwin,et al.  Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity , 2006, Canadian AI.

[49]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[50]  Jean Charlet,et al.  Building medical ontologies by terminology extraction from texts: An experiment for the intensive care units , 2006, Comput. Biol. Medicine.

[51]  Andrew McCallum,et al.  Structured Relation Discovery using Generative Models , 2011, EMNLP.

[52]  Gretchen A. Stevens,et al.  Global burden of cancer attributable to high body-mass index in 2012: a population-based study. , 2015, The Lancet. Oncology.

[53]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[54]  Ralph Grishman,et al.  Discovering Relations among Named Entities from Large Corpora , 2004, ACL.