Automatic extraction of knowledge from web documents

A large amount of digital information available is written as text documents in the form of web pages, reports, papers, emails, etc. Extracting the knowledge of interest from such documents from multiple sources in a timely fashion is therefore crucial. This paper provides an update on the Artequakt system which uses natural language tools to automatically extract knowledge about artists from multiple documents based on a predefined ontology. The ontology represents the type and form of knowledge to extract. This knowledge is then used to generate tailored biographies. The information extraction process of Artequakt is detailed and evaluated in this paper.

[1]  Removed Cross Document Annotation for Multimedia Retrieval , 2003 .

[2]  Janusz Kacprzyk,et al.  Intelligent Exploration of the Web , 2003, Studies in Fuzziness and Soft Computing.

[3]  David E. Millard,et al.  Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[4]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[5]  Elaine Marsh,et al.  MUC-7 Evaluation of IE Technology: Overview of Results , 1998, MUC.

[6]  David Evans,et al.  Tracking and summarizing news on a daily basis with Columbia's Newsblaster , 2002 .

[7]  Bernard Mérialdo,et al.  Automatic construction of personalized TV news programs , 1999, MULTIMEDIA '99.

[8]  David E. Millard,et al.  Artequakt: Generating Tailored Biographies with Automatically Annotated Fragments from the Web , 2002, SAAKM@ECAI.

[9]  Arthur Stutt,et al.  MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup , 2002, EKAW.

[10]  Steffen Staab,et al.  Bootstrapping an ontology-based information extraction system for the web , 2003 .

[11]  Steffen Staab,et al.  An annotation framework for the semantic web , 2001 .

[12]  Claire Cardie,et al.  Multidocument Summarization via Information Extraction , 2001, HLT.

[13]  Ralph Grishman,et al.  A Corpus-based Probabilistic Grammar with Only Two Non-terminals , 1995, IWPT.

[14]  Dragomir R. Radev,et al.  Generating Natural Language Summaries from Multiple On-Line Sources , 1998, CL.

[15]  Lloyd Rutledge,et al.  Generating presentation constraints from rhetorical structure , 2000, HYPERTEXT '00.

[16]  Steffen Staab,et al.  Bootstrapping an Ontology-Based Information Extraction System , 2003, Intelligent Exploration of the Web.

[17]  Steffen Staab,et al.  S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[18]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[19]  David E. Millard,et al.  Auld Leaky: A Contextual Open Hypermedia Link Server , 2001, OHS-7/SC-3/AH-3.