Formative evaluation of ontology learning methods for entity discovery by using existing ontologies as reference standards.

OBJECTIVE Developing a two-step method for formative evaluation of statistical Ontology Learning (OL) algorithms that leverages existing biomedical ontologies as reference standards. METHODS In the first step optimum parameters are established. A 'gap list' of entities is generated by finding the set of entities present in a later version of the ontology that are not present in an earlier version of the ontology. A named entity recognition system is used to identify entities in a corpus of biomedical documents that are present in the 'gap list', generating a reference standard. The output of the algorithm (new entity candidates), produced by statistical methods, is subsequently compared against this reference standard. An OL method that performs perfectly will be able to learn all of the terms in this reference standard. Using evaluation metrics and precision-recall curves for different thresholds and parameters, we compute the optimum parameters for each method. In the second step, human judges with expertise in ontology development evaluate each candidate suggested by the algorithm configured with the optimum parameters previously established. These judgments are used to compute two performance metrics developed from our previous work: Entity Suggestion Rate (ESR) and Entity Acceptance Rate (EAR). RESULTS Using this method, we evaluated two statistical OL methods for OL in two medical domains. For the pathology domain, we obtained 49% ESR, 28% EAR with the Lin method and 52% ESR, 39% EAR with the Church method. For the radiology domain, we obtain 87% ESA, 9% EAR using Lin method and 96% ESR, 16% EAR using Church method. CONCLUSION This method is sufficiently general and flexible enough to permit comparison of any OL method for a specific corpus and ontology of interest.

[1]  Diana Maynard,et al.  Metrics for Evaluation of Ontology-based Information Extraction , 2006, EON@WWW.

[2]  William R. Hogan,et al.  Natural Language Processing methods and systems for biomedical ontology learning , 2011, J. Biomed. Informatics.

[3]  Steffen Staab,et al.  On How to Perform a Gold Standard Based Evaluation of Ontology Learning , 2006, SEMWEB.

[4]  David Sánchez,et al.  Web-scale taxonomy learning , 2005 .

[5]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[6]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[7]  Carlo Strapparava,et al.  Domain Kernels for Word Sense Disambiguation , 2005, ACL.

[8]  Sven Hartrumpf Extending Knowledge and Deepening Linguistic Processing for the Question Answering System InSicht , 2005, CLEF.

[9]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[10]  Sven Hartrumpf,et al.  University of Hagen at QA@CLEF 2005: Extending Knowledge and Deepening Linguistic Processing for Question Answering , 2005, CLEF.

[11]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[12]  Alfonso Valencia,et al.  Automatic ontology construction from the literature. , 2002, Genome informatics. International Conference on Genome Informatics.

[13]  Craig A. Morioka,et al.  IndexFinder: A Method of Extracting Key Concepts from Clinical Texts for Indexing , 2003, AMIA.

[14]  Paola Velardi,et al.  Automatic Ontology Learning : Supporting a Per-Concept Evaluation by Domain Experts , 2004 .

[15]  Wesley W. Chu,et al.  IndexFinder : A Knowledge-based Method for Indexing Clinical Texts , 2003 .

[16]  Yorick Wilks,et al.  Data Driven Ontology Evaluation , 2004, LREC.

[17]  Steffen Staab,et al.  Measuring Similarity between Ontologies , 2002, EKAW.

[18]  Eugene Charniak,et al.  Finding Parts in Very Large Corpora , 1999, ACL.

[19]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[20]  C G Chute,et al.  Effectiveness of Lexico-syntactic Pattern Matching for Ontology Enrichment with Clinical Documents , 2010, Methods of Information in Medicine.

[21]  Daniel L. Rubin,et al.  FMA-RadLex: An Application Ontology of Radiological Anatomy derived from the Foundational Model of Anatomy Reference Ontology , 2008, AMIA.

[22]  Olivier Bodenreider,et al.  Unsupervised,corpus-based method for extending a biomedical terminology , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[23]  Steffen Staab,et al.  Learning Taxonomic Relations from Heterogeneous Sources of Evidence , 2005 .

[24]  Olatz Ansa,et al.  Enriching very large ontologies using the WWW , 2000, ECAI Workshop on Ontology Learning.