The Light-Weight Semantic Web: Integrating Information Extraction and Information Retrieval for Heterogeneous Environments

Today’s Web, large intranets and even the documents collected by a single user are enormous sources of distributed, heterogeneous information that cannot be easily mastered. Syntactical and semantical dierences as well as missing semantic annotations make eective query evaluation on such corpora a hard task. The Semantic Web aims at providing a standard for semantic annotations, but has not yet made large progress in the real world. This paper presents a light-weight version of the Semantic Web. We advocate the use of Information Extraction tools to automatically detect and annotate important classes of information that are frequently used in queries, like locations and dates. We propose a query language that can exploit the extra annotations and allows novel range and join conditions.

[1]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[2]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[3]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[4]  William Kent,et al.  Limitations of record-based information models , 1979, TODS.

[5]  Soumen Chakrabarti Breaking Through the Syntax Barrier: Searching with Entities and Relations , 2004, PKDD.

[6]  Alexandra Poulovassilis,et al.  Combining data integration with natural language technology for the semantic web , 2003 .

[7]  Yorick Wilks,et al.  Designing Adaptive Information Extraction for the Semantic Web in Amilcare , 2003 .

[8]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[9]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[10]  James Frew,et al.  Accessing the alexandria digital library from geographic information systems , 2004, JCDL.

[11]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[12]  Kalina Bontcheva,et al.  Towards a semantic extraction of named entities , 2003 .

[13]  William Kent,et al.  Limitations of record-based information models , 1979, TODS.

[14]  Norbert Fuhr,et al.  Information Extraction and Automatic Markup for XML Documents , 2003, Intelligent Search on XML Data.

[15]  Mark Craven,et al.  Representing Sentence Structure in Hidden Markov Models for Information Extraction , 2001, IJCAI.

[16]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[17]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[18]  Gerhard Weikum,et al.  The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents , 2005, VLDB.

[19]  Gerhard Weikum,et al.  Intelligent Search on XML Data , 2003, Lecture Notes in Computer Science.

[20]  Gobinda G. Chowdhury,et al.  Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential , 2004 .

[21]  Robert J. Gaizauskas,et al.  Coupling information retrieval and information extraction: A new text technology for gathering information from the web , 1997, RIAO.

[22]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[23]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.