Web Data Extraction for Service Creation

Web data extraction is an enabling technique in the search computing scenario. In this chapter, we first review the state of the art in wrapper technologies focusing on how wrapper generators can be used to create unified services that integrate data from Web Applications and Web services in various domains. Next, we describe the Lixto approach and we present the Lixto Suite as one example of Web Process Integration. Finally, application areas and future challenges and the usage of wrapper technologies in the search computing context is discussed.

[1]  Kevin Chen-Chuan Chang,et al.  Towards building a MetaQuerier: extracting and matching Web query interfaces , 2005, 21st International Conference on Data Engineering (ICDE'05).

[2]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[3]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[4]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[5]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[6]  Marko Banek,et al.  Uncovering the Deep Web: Transferring Relational Database Content and Metadata to OWL Ontologies , 2008, KES.

[7]  Alberto H. F. Laender,et al.  DEByE - Data Extraction By Example , 2002, Data Knowl. Eng..

[8]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[9]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[10]  Georg Gottlob,et al.  Web Data Extraction System , 2009, Encyclopedia of Database Systems.

[11]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[12]  Lorenzo Blanco,et al.  Flint: Google-basing the Web , 2008, EDBT '08.

[13]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[14]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[15]  Georg Gottlob,et al.  Interactively adding Web service interfaces to existing Web applications , 2004, 2004 International Symposium on Applications and the Internet. Proceedings..

[16]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[17]  Lakhmi C. Jain,et al.  Knowledge-Based Intelligent Information and Engineering Systems , 2004, Lecture Notes in Computer Science.

[18]  Georg Gottlob,et al.  InfoPipes: A Flexible Framework for M-Commerce Applications , 2001, TES.

[19]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[20]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[21]  Raghu Ramakrishnan,et al.  Source-aware Entity Matching: A Compositional Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[23]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[24]  Hector Garcia-Molina,et al.  Semistructured Data: The Tsimmis Experience , 1997, ADBIS.

[25]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[26]  Georg Gottlob,et al.  Scalable Web Data Extraction for Online Market Intelligence , 2009, Proc. VLDB Endow..

[27]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[28]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[29]  Ángel Viña,et al.  The Wargo system: semi-automatic wrapper generation in presence of complex data access modes , 2002, Proceedings. 13th International Workshop on Database and Expert Systems Applications.

[30]  Robert Baumgartner,et al.  DeepWeb Navigation in Web Data Extraction , 2005, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06).

[31]  Robert Baumgartner,et al.  Using Lixto for automating portal-based B2B processes in the automotive industry , 2004, Int. J. Electron. Bus..

[32]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[33]  Georg Gottlob,et al.  Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto , 2001, LPNMR.

[34]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[35]  Wolfgang Faber,et al.  Logic Programming and Nonmonotonic Reasoning , 2011, Lecture Notes in Computer Science.

[36]  Gottfried Vossen,et al.  Web Information Systems Engineering - WISE 2009, 10th International Conference, Poznan, Poland, October 5-7, 2009. Proceedings , 2009, WISE.

[37]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[38]  Oren Etzioni,et al.  Structured querying of web text , 2007 .

[39]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[40]  Alan Bundy,et al.  Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence - IJCAI-95 , 1995 .

[41]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[42]  Robert Baumgartner,et al.  Automated Ontology-Driven Metasearch Generation with Metamorph , 2009, WISE.

[43]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[44]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[45]  Georg Gottlob,et al.  Visual Programming of Web Data Aggregation Applications , 2003, IIWeb.

[46]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[47]  Raghu Ramakrishnan,et al.  Toward best-effort information extraction , 2008, SIGMOD Conference.