Terminologies for text-mining; an experiment in the lipoprotein metabolism domain

BackgroundThe engineering of ontologies, especially with a view to a text-mining use, is still a new research field. There does not yet exist a well-defined theory and technology for ontology construction. Many of the ontology design steps remain manual and are based on personal experience and intuition. However, there exist a few efforts on automatic construction of ontologies in the form of extracted lists of terms and relations between them.ResultsWe share experience acquired during the manual development of a lipoprotein metabolism ontology (LMO) to be used for text-mining. We compare the manually created ontology terms with the automatically derived terminology from four different automatic term recognition (ATR) methods. The top 50 predicted terms contain up to 89% relevant terms. For the top 1000 terms the best method still generates 51% relevant terms. In a corpus of 3066 documents 53% of LMO terms are contained and 38% can be generated with one of the methods.ConclusionsGiven high precision, automatic methods can help decrease development time and provide significant support for the identification of domain-specific vocabulary. The coverage of the domain vocabulary depends strongly on the underlying documents. Ontology development for text mining should be performed in a semi-automatic way; taking ATR results as input and following the guidelines we described.AvailabilityThe TFIDF term recognition is available as Web Service, described at http://gopubmed4.biotec.tu-dresden.de/IdavollWebService/services/CandidateTermGeneratorService?wsdl

[1]  Rogers P Pole The GALEN High Level Ontology , 1996 .

[2]  Olivier Bodenreider,et al.  Bio-ontologies: current trends and future directions , 2006, Briefings Bioinform..

[3]  王林,et al.  GoPubmed , 2010 .

[4]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[5]  Lyle H. Ungar,et al.  Automatic Labeling of Document Clusters , 2000, KDD 2000.

[6]  Eneko Agirre,et al.  Knowledge Sources for WSD , 2007 .

[7]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[8]  José L. V. Mejino,et al.  A reference ontology for biomedical informatics: the Foundational Model of Anatomy , 2003, J. Biomed. Informatics.

[9]  K. Bretonnel Cohen,et al.  Implications of Compositionality in the Gene Ontology for Its Curation and Usage , 2004, Pacific Symposium on Biocomputing.

[10]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[11]  Allen C. Browne,et al.  Lexical methods for managing variation in biomedical terminologies. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[12]  Mike Uschold,et al.  Building Ontologies: Towards a Unified Methodology , 1996 .

[13]  Chris F. Taylor,et al.  The use of concept maps during knowledge elicitation in ontology development processes – the nutrigenomics use case , 2006, BMC Bioinformatics.

[14]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[15]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[16]  Guiraude Lame,et al.  Using NLP Techniques to Identify Legal Ontology Components: Concepts and Relations , 2004, Artificial Intelligence and Law.

[17]  Stuart C. Shapiro Review of Knowledge representation: logical, philosophical, and computational foundations by John F. Sowa. Brooks/Cole 2000. , 2001 .

[18]  John F. Sowa,et al.  Knowledge representation: logical, philosophical, and computational foundations , 2000 .

[19]  Ross D King,et al.  Are the current ontologies in biology good ontologies? , 2005, Nature Biotechnology.

[20]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[21]  M. Ashburner,et al.  An ontology for cell types , 2005, Genome Biology.

[22]  Alan L. Rector Defaults, Context, and Knowledge: Alternatives for OWL-Indexed Knowledge Bases , 2004, Pacific Symposium on Biocomputing.

[23]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[24]  Betsy L. Humphreys,et al.  Relationships in Medical Subject Headings (MeSH) , 2001 .

[25]  Carol A. Bean,et al.  Relationships in the Organization of Knowledge , 2001, Information Science and Knowledge Management.

[26]  Dawid Weiss,et al.  Carrot and Language Properties in Web Search Results Clustering , 2003, AWIC.

[27]  Judith A. Blake,et al.  Beyond the data deluge: Data integration and bio-ontologies , 2006, J. Biomed. Informatics.

[28]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .

[29]  Carole A. Goble,et al.  An ontology for bioinformatics applications , 1999, Bioinform..

[30]  Chris F. Taylor,et al.  The MGED Ontology: a resource for semantics-based description of microarray experiments , 2006, Bioinform..

[31]  Carole A. Goble,et al.  A Methodology to Migrate the Gene Ontology to a Description Logic Environment Using DAML+OIL , 2002, Pacific Symposium on Biocomputing.

[32]  K. Bretonnel Cohen,et al.  The Compositional Structure of Gene Ontology Terms , 2003, Pacific Symposium on Biocomputing.

[33]  Paola Velardi,et al.  Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites , 2004, CL.

[34]  Sean Bechhofer,et al.  Understanding and using the meaning of statements in a bio-ontology: recasting the Gene Ontology in OWL , 2007, BMC Bioinformatics.

[35]  Barry Smith,et al.  On the Application of Formal Principles to Life Science Data: a Case Study in the Gene Ontology , 2004, DILS.

[36]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[37]  Johanna Völker,et al.  A Framework for Ontology Learning and Data-driven Change Discovery , 2005 .

[38]  Robert Hoehndorf,et al.  Representing default knowledge in biomedical ontologies: application to the integration of anatomy and phenotype ontologies , 2007, BMC Bioinformatics.

[39]  Kent A Spackman,et al.  SNOMED CT milestones: endorsements are added to already-impressive standards credentials. , 2004, Healthcare informatics : the business magazine for information and communication systems.

[40]  Miguel A. Andrade-Navarro,et al.  Update on XplorMed: a web server for exploring scientific literature , 2003, Nucleic Acids Res..

[41]  C. Bult,et al.  Systems biology of the 2-cell mouse embryo , 2004, Cytogenetic and Genome Research.

[42]  M. Rizzo,et al.  LDL size: does it matter? , 2004, Swiss medical weekly.