Web Data Extraction

Creation of web wrappers (i.e programs that extract data from the web) is a subject of study in the field of web data extraction. Designing a domainspecific language for a web wrapper is a challenging task, because it introduces trade-offs between expressiveness of a wrapper’s language and safety. In addition, little attention has been paid to execution of a wrapper in restricted environment. In this thesis, we present a new wrapping language – Serrano – that has three goals in mind. (1) Ability to run in restricted environment, such as a browser extension, (2) extensibility, to balance the tradeoffs between expressiveness of a command set and safety, and (3) processing capabilities, to eliminate the need for additional programs to clean the extracted data. Serrano has been successfully deployed in a number of projects and provided encouraging results.

[1]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[2]  Balachander Krishnamurthy,et al.  Key differences between Web 1.0 and Web 2.0 , 2008, First Monday.

[3]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[4]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[5]  Douglas Crockford,et al.  The application/json Media Type for JavaScript Object Notation (JSON) , 2006, RFC.

[6]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[7]  Nicholas Kushmerick,et al.  Finite-State Approaches to Web Information Extraction , 2002, SCIE.

[8]  Georg Gottlob,et al.  Monadic datalog and the expressive power of languages for web information extraction , 2002, JACM.

[9]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[10]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[11]  Steffen Staab,et al.  SXPath - Extending XPath towards Spatial Querying on Web Documents , 2010, Proc. VLDB Endow..

[12]  John W. Lloyd,et al.  Practical Advtanages of Declarative Programming , 1994, GULP-PRODE.

[13]  James Clark,et al.  XSL Transformations (XSLT) Version 1.0 , 1999 .

[14]  Stephen L. Burbeck,et al.  Applications programming in smalltalk-80: how to use model-view-controller (mvc) , 1987 .

[15]  Georg Gottlob,et al.  The Elog Web Extraction Language , 2001, LPAR.

[16]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[17]  Bertram Ludäscher,et al.  Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective , 1998, Inf. Syst..

[18]  Maarten Marx,et al.  Conditional XPath, the first order complete XPath dialect , 2004, PODS.

[19]  Yakov Shafranovich,et al.  Common Format and MIME Type for Comma-Separated Values (CSV) Files , 2005, RFC.

[20]  Georg Gottlob,et al.  Web Data Extraction System , 2009, Encyclopedia of Database Systems.

[21]  Edsger W. Dijkstra,et al.  Letters to the editor: go to statement considered harmful , 1968, CACM.

[22]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[23]  Alberto H. F. Laender,et al.  DEByE - Data Extraction By Example , 2002, Data Knowl. Eng..

[24]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[25]  Raymond J. Mooney,et al.  Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction , 2003, J. Mach. Learn. Res..

[26]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[27]  Steven J. DeRose,et al.  Xml pointer language (xpointer) , 1998 .

[28]  Emilio Ferrara,et al.  Intelligent Self-repairable Web Wrappers , 2011, AI*IA.

[29]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[30]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[31]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[32]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[33]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[34]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[35]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[36]  Jonathan Robie,et al.  Editors , 2003 .

[37]  Hector Garcia-Molina,et al.  Semistructured Data: The Tsimmis Experience , 1997, ADBIS.