Integrating Semi-structured Data into Business Applications: A Web Intelligence Example

The World Wide Web, representing a universe of knowledge, provides public domain information about market developments and competitor activities on the market. This information is becoming more and more a critical success factor for enterprises and can be retrieved for example from Web sites or online shops. The extraction from these semi-structured information sources is mostly done manually and is very time consuming. Therefore, powerful and user-friendly tools for extracting and integrating information from various different Web sources, or in general, various heterogeneous semi-structured data sources are needed. In this paper we describe a solution how data from public information sources, in particular from the World Wide Web, can be retrieved and normalized to structured data formats automatically. We also illustrate how this data can be automatically integrated afterwards in – often complex – Web Intelligence applications.

[1]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[2]  Bertram Ludäscher,et al.  A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web , 1999, ER.

[3]  Larry Kahaner,et al.  Competitive Intelligence: How to Gather Analyze and Use Information to Move Your Business to the Top , 1996 .

[4]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[5]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[6]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[7]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[8]  Manuel V. Hermenegildo,et al.  Distributed WWW Programming using (Ciao-)Prolog and the PiLLoW library , 2001, Theory Pract. Log. Program..

[9]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[10]  I. V. Ramakrishnan,et al.  Computational aspects of resilient data extraction from semistructured sources (extended abstract) , 2000, PODS '00.

[11]  Prabhakar Raghavan,et al.  Social Networks on the Web and in the Enterprise , 2001, Web Intelligence.

[12]  Georg Gottlob,et al.  InfoPipes: A Flexible Framework for M-Commerce Applications , 2001, TES.

[13]  Georg Gottlob,et al.  Monadic datalog and the expressive power of languages for web information extraction , 2002, JACM.

[14]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[15]  Ángel Viña,et al.  The Wargo system: semi-automatic wrapper generation in presence of complex data access modes , 2002, Proceedings. 13th International Workshop on Database and Expert Systems Applications.

[16]  Feifei Li,et al.  Wiccap Data Model: Mapping Physical Websites to Logical Views , 2002, ER.

[17]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[18]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[19]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[20]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[21]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.