Towards Ontology-based Information Extraction and Annotation of Paper Documents for Personalized Knowledge Acquisition

Despite the advent of electronic personal information management (PIM) tools, knowledge workers are still heavily using paper-based information sources. But up to now, even in sophisticated tools for PIM such as the Semantic Desktop, the knowledge workers’ paper world is still neglected. Thus, electronic archiving of a web page for later reference is much easier than taking care of an interesting article in a magazine—whose copy might set dust on the user’s shelf and will long be forgotten when it would be helpful for a specific task. This paper presents how to use document analysis, ontology-based information extraction, and annotation techniques for personal knowledge acquisition in order to bridge the gap between the user’s paper world and his personal knowledge space in the Semantic Desktop. A recent prototype shows the feasibility of the approach.

[1]  Marja-Riitta Koivunen,et al.  Annotea: an open RDF infrastructure for shared Web annotations , 2001, WWW '01.

[2]  Thomas M. Breuel,et al.  Gestural Interaction for an Automatic Document Capture System , 2007 .

[3]  Hyoil Han,et al.  Survey of semantic annotation platforms , 2005, SAC '05.

[4]  Claudia Wenzel,et al.  An Approach to Context-driven Document Analysis and Understanding , 2000 .

[5]  L. Sauermann,et al.  ConTag : A Semantic Tag Recommendation System , 2007 .

[6]  Kalina Bontcheva,et al.  Evolving GATE to meet new challenges in language engineering , 2004, Natural Language Engineering.

[7]  Thomas M. Breuel The hOCR Microformat for OCR Workflow and Results , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[8]  David W. Embley,et al.  Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.

[9]  Leo Sauermann,et al.  The Semantic Desktop as a foundation for PIM research , 2007 .

[10]  Siegfried Handschuh,et al.  Semantic annotation for knowledge management: Requirements and a survey of the state of the art , 2006, J. Web Semant..

[11]  Georg Buscher,et al.  Using Attention and Context Information for Annotations in a Semantic Wiki , 2008, SemWiki.

[12]  Andreas Dengel,et al.  Demo Abstract : Semantic Annotation of paper-based Information , 2007 .

[13]  Andreas Dengel,et al.  Believing Finite-State Cascades in Knowledge-Based Information Extraction , 2008, KI.

[14]  L. Sauermann,et al.  PIMO-a Framework for Representing Personal Information Models , 2007 .

[15]  Steffen Staab,et al.  CREAM: CREAting Metadata for the Semantic Web , 2003, Comput. Networks.

[16]  Atanas Kiryakov,et al.  KIM - Semantic Annotation Platform , 2003, SEMWEB.

[17]  Abigail Sellen,et al.  The myth of the paperless office , 2001 .

[18]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[19]  Malte Kiesel Kaukolu: Hub of the Semantic Corporate Intranet , 2006, SemWiki.

[20]  Ansgar Bernardi,et al.  Overview and Outlook on the Semantic Desktop , 2005, Semantic Desktop Workshop.

[21]  Jean-Luc Minel,et al.  Document annotation and ontology population from linguistic extractions , 2005, K-CAP '05.

[22]  Douglas E. Appelt,et al.  Introduction to Information Extraction Technology , 1999, IJCAI 1999.

[23]  Ansgar Bernardi,et al.  Leveraging Passive Paper Piles to Active Objects in Personal Knowledge Spaces , 2005, Wissensmanagement.

[24]  Harald Holz,et al.  From Lightweight, Proactive Information Delivery to Business Process-Oriented Knowledge Management , 2005 .

[25]  Kazem Taghva,et al.  The Effects of OCR Error on the Extraction of Private Information , 2006, Document Analysis Systems.

[26]  Andreas R. Dengel,et al.  Six Thousand Words about Multi-Perspective Personal Document Management , 2006, 2006 10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06).