From Web Directories to Ontologies: Natural Language Processing Challenges

Hierarchical classifications are used pervasively by humans as a means to organize their data and knowledge about the world. One of their main advantages is that natural language labels, used to describe their contents, are easily understood by human users. However, at the same time, this is also one of their main disadvantages as these same labels are ambiguous and very hard to be reasoned about by software agents. This fact creates an insuperable hindrance for classifications to being embedded in the Semantic Web infrastructure. This paper presents an approach to converting classifications into lightweight ontologies, and it makes the following contributions: (i) it identifies the main NLP problems related to the conversion process and shows how they are different from the classical problems of NLP; (ii) it proposes heuristic solutions to these problems, which are especially effective in this domain; and (iii) it evaluates the proposed solutions by testing them on DMoz data.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Rolf Schwitter,et al.  Let's talk in description logic via controlled natural language , 2006 .

[3]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[4]  Eneko Agirre,et al.  A Proposal for Word Sense Disambiguation using Conceptual Distance , 1995, ArXiv.

[5]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[6]  G Stix,et al.  The mice that warred. , 2001, Scientific American.

[7]  Jimmy J. Lin,et al.  Annotating the Semantic Web Using Natural Language , 2002, NLPXML@COLING.

[8]  Robert Stevens,et al.  OWL Pizzas: Practical Experience of Teaching OWL-DL: Common Errors & Common Patterns , 2004, EKAW.

[9]  Fausto Giunchiglia,et al.  Encoding Classifications into Lightweight Ontologies , 2006, J. Data Semant..

[10]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[11]  Chong Wang,et al.  PANTO: A Portable Natural Language Interface to Ontologies , 2007, ESWC.

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[14]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[15]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[16]  Che-Yu Yang,et al.  Word Sense Determination using WordNet and Sense Co-occurrence , 2006, 20th International Conference on Advanced Information Networking and Applications - Volume 1 (AINA'06).

[17]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[18]  Doug Downey,et al.  Locating Complex Named Entities in Web Text , 2007, IJCAI.

[19]  Yvette J. Tenney,et al.  A Methodology for Extrinsically Evaluating Information Extraction Performance , 2005, HLT/EMNLP.

[20]  Abraham Bernstein,et al.  GINO - A Guided Input Natural Language Ontology Editor , 2006, SEMWEB.

[21]  Fausto Giunchiglia,et al.  Formalizing the Get-Specific Document Classification Algorithm , 2007, ECDL.

[22]  Abraham Bernstein,et al.  Querying Ontologies: A Controlled English Interface for End-Users , 2005, SEMWEB.

[23]  Jos de Bruijn,et al.  GenTax: A Generic Methodology for Deriving OWL and RDF-S Ontologies from Hierarchical Classifications, Thesauri, and Inconsistent Taxonomies , 2007, ESWC.

[24]  Fausto Giunchiglia,et al.  Semantic Matching: Algorithms and Implementation , 2007, J. Data Semant..

[25]  John Mylopoulos,et al.  Journal on Data Semantics IX , 2007, Journal on Data Semantics IX.

[26]  James A. Hendler,et al.  Agents and the Semantic Web , 2001, IEEE Intell. Syst..