Capturing Document Semantics for Ontology Generation and Document Summarization

When dealing with a document collection, it is important to identify repeated information. In multi-document summarization, for example, it is important to retain widely repeated content, even if the wording is not exactly the same. Simplistic approaches simply look for the same strings, or the same syntactic structures (including words), across documents. Here we investigate semantic matching, applying background knowledge from a large, general knowledge base (KB) to identify such repeated information in texts. Automatic document summarization is the problem of creating a surrogate for a document that adequately represents its full content. Automatic ontology generation requires information about candidate types, roles and relationships gathered from across a document or document collection. We aim at a summarization system that can replicate the quality of summaries created by humans and ontology creation systems that significantly reduce the human effort required for construction. Both applications depend for their success on extracting the essence of a collection of text. The work reported here demonstrates the utility of using deep knowledge from Cyc for effectively identifying redundant information in texts by using both semantic and syntactic information.

[1]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[2]  John D. Lafferty,et al.  A Robust Parsing Algorithm for Link Grammars , 1995, IWPT.

[3]  Dunja Mladenic,et al.  Semi-automatic Construction of Topic Ontologies , 2005, EWMF/KDO.

[4]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[5]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[6]  Michael J. Witbrock,et al.  An Introduction to the Syntax and Content of Cyc , 2006, AAAI Spring Symposium: Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering.

[7]  Michael J. Witbrock,et al.  Automated Population of Cyc: Extracting Information about Named-entities from the Web , 2006, FLAIRS.

[8]  Dunja Mladenic,et al.  Automated knowledge discovery in advanced knowledge management , 2005, J. Knowl. Manag..

[9]  David Baxter,et al.  Knowledge formation and dialogue using the KRAKEN toolset , 2002, AAAI/IAAI.

[10]  Marko Grobelnik,et al.  Learning Sub-structures of Document Semantic Graphs for Document Summarization , 2004 .

[11]  Dunja Mladenic,et al.  Visualization of Text Document Corpus , 2005, Informatica.

[12]  Christopher D. Manning,et al.  An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition , 2006, ACL.

[13]  Jure Leskovec,et al.  Impact of Linguistic Analysis on the Semantic Graph Coverage and Learning of Document Extracts , 2005, AAAI.