The Smallest Extraction Problem

We introduce landmark grammars , a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem (SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.

[1]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[2]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[3]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[5]  Tim Furche,et al.  Robust and Noise Resistant Wrapper Induction , 2016, SIGMOD Conference.

[6]  Aditya G. Parameswaran,et al.  Optimal schemes for robust web extraction , 2011, Proc. VLDB Endow..

[7]  Luca Breveglieri,et al.  Formal Languages and Compilation , 2009, Texts in Computer Science.

[8]  Matteo Pradella,et al.  Toward a theory of input-driven locally parsable languages , 2017, Theor. Comput. Sci..

[9]  Qiang Hao,et al.  From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[10]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[11]  Valter Crescenzi,et al.  Handling irregularities in ROADRUNNER , 2004, AAAI 2004.

[12]  Valter Crescenzi,et al.  Extraction and Integration of Partially Overlapping Web Sources , 2013, Proc. VLDB Endow..

[13]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[14]  Tim Furche,et al.  Joint repairs for web wrappers , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[15]  Sumit Gulwani,et al.  Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference , 2020, SIGMOD Conference.

[16]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[17]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[18]  C. Jacobs,et al.  Parsing Techniques: A Practical Guide, 2nd edition , 2008 .

[19]  Matthias Gallé,et al.  The Generalized Smallest Grammar Problem , 2016, ICGI.

[20]  Judea Pearl,et al.  Heuristics : intelligent search strategies for computer problem solving , 1984 .

[21]  Nilesh N. Dalvi,et al.  Robust web extraction: an approach based on a probabilistic tree-edit model , 2009, SIGMOD Conference.

[22]  Hannaneh Hajishirzi,et al.  Web-scale Knowledge Collection , 2020, WSDM.

[23]  Nicholas Kushmerick,et al.  Mining web logs for personalized site maps , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering (Workshops), 2002..

[24]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[25]  Henning Fernau,et al.  On the Complexity of the Smallest Grammar Problem over Fixed Alphabets , 2020, Theory Comput. Syst..

[26]  Markus Lohrey,et al.  The Smallest Grammar Problem Revisited , 2016, IEEE Transactions on Information Theory.

[27]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[28]  Dick Grune,et al.  Parsing Techniques (Monographs in Computer Science) , 2006 .

[29]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[30]  Valter Crescenzi,et al.  Alaska: A Flexible Benchmark for Data Integration Tasks , 2021, ArXiv.

[31]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[32]  Boris Chidlovskii,et al.  Documentum ECI self-repairing wrappers: performance analysis , 2006, SIGMOD Conference.

[33]  Khaled Shaalan,et al.  FiVaTech: Page-Level Web Data Extraction from Template Pages , 2007, IEEE Transactions on Knowledge and Data Engineering.

[34]  Xin Dong,et al.  OpenCeres: When Open Information Extraction Meets the Semi-Structured Web , 2019, NAACL.

[35]  Valter Crescenzi,et al.  Hybrid Crowd-Machine Wrapper Inference , 2019, ACM Trans. Knowl. Discov. Data.

[36]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[37]  Tim Furche,et al.  DIADEM: Thousands of Websites to a Single Database , 2014, Proc. VLDB Endow..

[38]  Markus Lohrey,et al.  Algorithmics on SLP-compressed strings: A survey , 2012, Groups Complex. Cryptol..

[39]  Rafael Corchuelo,et al.  Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction , 2014, IEEE Transactions on Knowledge and Data Engineering.

[40]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[41]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[42]  Valter Crescenzi,et al.  Crowdsourcing large scale wrapper inference , 2014, Distributed and Parallel Databases.

[43]  Robert W. Floyd,et al.  Syntactic Analysis and Operator Precedence , 1963, JACM.

[44]  Jean-Christophe Aval,et al.  Multivariate Fuss-Catalan numbers , 2007, Discret. Math..

[45]  Xin Luna Dong,et al.  CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web , 2018, Proc. VLDB Endow..

[46]  Stefano Crespi-Reghizzi,et al.  Operator Precedence and the Visibly Pushdown Property , 2010, LATA.

[47]  Jun Ma,et al.  AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types , 2020, KDD.

[48]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[49]  Tim Furche,et al.  WADaR: Joint Wrapper and Data Repair , 2015, Proc. VLDB Endow..

[50]  Xiaoying Wu,et al.  A survey on XML streaming evaluation techniques , 2013, The VLDB Journal.

[51]  Mohd Amir Bin Mohd Azir,et al.  Wrapper approaches for web data extraction : A review , 2017, 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI).

[52]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[53]  Tim Furche,et al.  RED: Redundancy-Driven Data Extraction from Result Pages? , 2019, WWW.