Automatic Ontology Extraction from Unstructured Texts

Construction of the ontology of a specific domain currently relies on the intuition of a knowledge engineer, and the typical output is a thesaurus of terms, each of which is expected to denote a concept. Ontological ‘engineers’ tend to hand-craft these thesauri on an ad-hoc basis and on a relatively smallscale. Workers in the specific domain create their own special language, and one device for this creation is the repetition of select keywords for consolidating or rejecting one or more concepts. A more scalable, systematic and automatic approach to ontology construction is possible through the automatic identification of these keywords. An approach for the study and extraction of keywords is outlined where a corpus of randomly collected unstructured, i.e. not containing any kind of mark-up, texts in a specific domain is analysed with reference to the lexical preferences of the workers in the domain. An approximation about the role of frequently used single words within multiword expressions leads us to the creation of a semantic network. The network can be asserted into a terminology database or knowledge representation formalism, and the relationship between the nodes of the network helps in the visualisation of, and automatic inference over, the frequently used words denoting important concepts in the domain. We illustrate our approach with a case study using corpora from three time periods on the emergence and consolidation of nuclear physics. The text-based approach appears to be less subjective and more suitable for introspection, and is perhaps useful in ontology evolution.

[1]  David Faure,et al.  ASIUM: Learning subcategorization frames and restrictions of se-18 lection , 1998 .

[2]  Petra Steffens,et al.  Machine Translation and the Lexicon , 1993, Lecture Notes in Computer Science.

[3]  Raphael Volz,et al.  The text-to-onto ontology extraction and maintenance system , 2001 .

[4]  A. T. Schreiber,et al.  Ontologies as vehicles for reuse: a mini-experiment , 1996 .

[5]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[6]  Andrei Mikheev,et al.  Towards a Workbench for Acquisition of Domain Knowledge from Natural Language , 1995, EACL.

[7]  Khurshid Ahmad,et al.  Enrico Fermi and the making of the language of nuclear physics , 2003 .

[8]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[9]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[10]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[11]  Kenneth Ward Church,et al.  - 1-What ’ s Wrong with Adding One ? , 1994 .

[12]  David Faure,et al.  Knowledge Acquisition of Predicate Argument Structures from Technical Texts Using Machine Learning: The System ASIUM , 1999, EKAW.

[13]  Willard Van Orman Quine Theories and Things , 1981 .

[14]  Graham Priest,et al.  Willard Van Orman Quine , 1979 .

[15]  Khurshid Ahmad,et al.  Pragmatics of Specialist Terms: The Acquisition and Representation of Terminology , 1993, EAMT Workshop.

[16]  Lee Gillam,et al.  Knowledge Exchange and Terminology Interchange: The Role of Standards , 2002, TC.

[17]  Hannu Vanharanta,et al.  Visualizing Sequences of Texts Using Collocational Networks , 2003, MLDM.

[18]  Michel C. A. Klein,et al.  Ontology Evolution: Not the Same as Schema Evolution , 2004, Knowledge and Information Systems.

[19]  Lee Gillam,et al.  Sharing the knowledge of experts , 2002 .

[20]  Mark A. Musen,et al.  Ontology versioning in an ontology management framework , 2004, IEEE Intelligent Systems.

[21]  Lee Gillam Systems of concepts and their extraction from text , 2004 .

[22]  Lee Gillam,et al.  Terminology and the construction of ontology , 2005 .

[23]  Steffen Staab,et al.  Handbook on Ontologies in Information Systems , 2003 .

[24]  Randolph Quirk,et al.  Grammatical and lexical variance in English , 1995 .

[25]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[26]  Raphael Volz,et al.  The Ontology Extraction & Maintenance Framework Text-To-Onto , 2001 .

[27]  M. Carroll The Belknap Press of Harvard University Press , 1970 .

[28]  Khurshid Ahmad,et al.  Corpus-Based Thesaurus Construction for Image Retrieval in Specialist Domains , 2003, ECIR.

[29]  Steffen Staab,et al.  Ontology Learning , 2004, Encyclopedia of Machine Learning and Data Mining.