A novel semantic information retrieval system based on a three-level domain model

This paper presents a methodology and a prototype for extracting and indexing knowledge from natural language documents. The underlying domain model relies on a conceptual level (described by means of a domain ontology), which represents the domain knowledge, and a lexical level (based on WordNet), which represents the domain vocabulary. A stochastic model (the ME-2L-HMM2, which mixes - in a novel way - HMM and maximum entropy models) stores the mapping between such levels, taking into account the linguistic context of words. Not only does such a context contain the surrounding words; it also contains morphologic and syntactic information extracted using natural language processing tools. The stochastic model is then used, during the document indexing phase, to disambiguate word meanings. The semantic information retrieval engine we developed supports simple keyword-based queries, as well as natural language-based queries. The engine is also able to extend the domain knowledge, discovering new and relevant concepts to add to the domain model. The validation tests indicate that the system is able to disambiguate and extract concepts with good accuracy. A comparison between our prototype and a classic search engine shows that the proposed approach is effective in providing better accuracy.

[1]  F. Jelinek Fast sequential decoding algorithm using a stack , 1969 .

[2]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[3]  Dejing Dou,et al.  Ontology-based information extraction: An introduction and a survey of current approaches , 2010, J. Inf. Sci..

[4]  M RinaldiAntonio An ontology-driven approach for semantic information retrieval on the Web , 2009 .

[5]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[6]  Tat-Seng Chua,et al.  A Public Reference Implementation of the RAP Anaphora Resolution Algorithm , 2004, LREC.

[7]  Giorgio Orsi,et al.  Methodologies and Technologies for Networked Enterprises , 2012, Lecture Notes in Computer Science.

[8]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[9]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[10]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[11]  Adwait Ratnaparkhi,et al.  A maximum entropy model for parsing , 1994, ICSLP.

[12]  Giorgio Orsi,et al.  Ontology-Based Knowledge Elicitation: An Architecture , 2012, ArtDeco.

[13]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[14]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[15]  Roberto Tedesco,et al.  Knowledge Extraction from Natural Language Processing , 2012, ArtDeco.

[16]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[17]  P. Smith,et al.  A review of ontology based query expansion , 2007, Inf. Process. Manag..

[18]  Christopher D. Manning,et al.  Stanford typed dependencies manual , 2010 .

[19]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[20]  Antonio Maria Rinaldi,et al.  An ontology-driven approach for semantic information retrieval on the Web , 2009, TOIT.

[21]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[22]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[23]  Sergey Bratus,et al.  Using domain knowledge for ontology-guided entity extraction from noisy, unstructured text data , 2009, AND '09.

[24]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[25]  Lluís Padró,et al.  FreeLing 1.3: Syntactic and semantic services in an open-source NLP library , 2006, LREC.

[26]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[28]  Yang He Extended Viterbi algorithm for second order hidden Markov process , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[29]  Alan L. Rector,et al.  Editing Description Logic Ontologies with the Protégé OWL Plugin , 2004, Description Logics.

[30]  David E. Millard,et al.  Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[31]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[32]  Dale J. Prediger,et al.  Coefficient Kappa: Some Uses, Misuses, and Alternatives , 1981 .

[33]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[34]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[35]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[36]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[37]  Adwait Ratnaparkhi,et al.  A Simple Introduction to Maximum Entropy Models for Natural Language Processing , 1997 .

[38]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[39]  Joshua Goodman,et al.  Sequential Conditional Generalized Iterative Scaling , 2002, ACL.

[40]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[41]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[42]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[43]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[44]  Hajer Baazaoui Zghal,et al.  SIRO: On-line semantic information retrieval using ontologies , 2007, 2007 2nd International Conference on Digital Information Management.

[45]  Justus J. Randolph Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss' Fixed-Marginal Multirater Kappa. , 2005 .

[46]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[47]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[48]  N. Curteanu Book Reviews: Lecture on Contemporary Syntactic Theories: An Introduction to Unification-Based Approaches to Grammar , 1987, CL.

[49]  Edwin T. Jaynes Prior Probabilities , 2010, Encyclopedia of Machine Learning.

[50]  Stuart M. Shieber The design of a computer language for linguistic information , 1984 .

[51]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[52]  Louise Guthrie,et al.  Another Look at the Data Sparsity Problem , 2006, TSD.

[53]  Hua-Yi Lin,et al.  An automated semantic annotation based-on Wordnet ontology , 2010, The 6th International Conference on Networked Computing and Advanced Information Management.