Extracting Disease-Symptom Relationships by Learning Syntactic Patterns from Dependency Graphs

Disease-symptom relationships are of primary importance for biomedical informatics, but databases that catalog them are incomplete in comparison with the state of the art available in the scientific literature. We propose in this paper a novel method for automatically extracting disease-symptom relationships from text, called SPARE (standing for Syntactic PAttern for Relationship Extraction). This method is composed of 3 successive steps: first, we learn patterns from the dependency graphs; second, we select best patterns based on their respective quality and specificity (their ability to identify only disease-symptom relationships); finally, the patterns are used on new texts for extracting disease-symptom relationships. We experimented SPARE on a corpus of 121,796 abstracts of PubMed related to 457 rare diseases. The quality of the extraction has been evaluated depending on the pattern quality and specificity. The best F-measure obtained is 55.65% (for speci f icity 0.5 and quality 0.5). To provide an insight on the novelty of disease-symptom relationship extracted, we compare our results to the content of phenotype databases (OrphaData and OMIM). Our results show the feasibility of automatically extracting disease-symptom relationships, including true relationships that were not already referenced in phenotype databases and may involve complex symptom descriptions.

[1]  Dietrich Rebholz-Schuhmann,et al.  LLL'05 Challenge: Genic Interaction Extraction - Identication of Language Patterns Based on Alignment and Finite State Automata , 2005 .

[2]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[3]  Thierry Charnois,et al.  Symptom extraction issue , 2014, BioNLP@ACL.

[4]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[5]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[6]  Terri K. Attwood,et al.  BioIE: extracting informative sentences from the biomedical literature , 2005, Bioinform..

[7]  E. Marcotte,et al.  An Improved, Bias-Reduced Probabilistic Functional Gene Network of Baker's Yeast, Saccharomyces cerevisiae , 2007, PloS one.

[8]  Karin M. Verspoor,et al.  Generalizing an Approximate Subgraph Matching-based System to Extract Events in Molecular Biology and Cancer Genetics , 2013, BioNLP@ACL.

[9]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[10]  Hans-Peter Kriegel,et al.  Extraction of semantic biomedical relations from text using conditional random fields , 2008, BMC Bioinformatics.

[11]  Lars Schmidt-Thieme,et al.  Relation Extraction for the Semantic Web with Taxonomic Sequential Patterns , 2011 .

[12]  Guodong Zhou,et al.  Exploring syntactic structured features over parse trees for relation extraction using kernel methods , 2008, Inf. Process. Manag..

[13]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[14]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[15]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[16]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[17]  Razvan C. Bunescu,et al.  Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome , 2005, Genome Biology.

[18]  Monica Mazzucato,et al.  A population-based registry as a source of health indicators for rare diseases: the ten-year experience of the Veneto Region’s rare diseases registry , 2014, Orphanet Journal of Rare Diseases.

[19]  Nathan Schneider,et al.  Association for Computational Linguistics: Human Language Technologies , 2011 .

[20]  Yannick Toussaint,et al.  Learning Subgraph Patterns from text for Extracting Disease - Symptom Relationships , 2014, DMNLP@PKDD/ECML.

[21]  Marie-Christine Jaulent,et al.  Sequential pattern mining to discover relations between genes and rare diseases , 2012, 2012 25th IEEE International Symposium on Computer-Based Medical Systems (CBMS).

[22]  Peer Bork,et al.  Extraction of regulatory gene/protein networks from Medline , 2006, Bioinform..

[23]  Zhiyong Lu,et al.  Disease named entity recognition and normalization with DNorm , 2014, BCB.

[24]  Razvan C. Bunescu,et al.  Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline , 2006, BioNLP@NAACL-HLT.

[25]  Damian Smedley,et al.  Clinical interpretation of CNVs with cross-species phenotype data , 2014, Journal of Medical Genetics.

[26]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[27]  Alberto Lavelli,et al.  Combining Tree Structures, Flat Features and Patterns for Biomedical Relation Extraction , 2012, EACL.

[28]  Marc Plantevit,et al.  Sequential Patterns to Discover and Characterise Biological Relations , 2010, CICLing.