Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future

In this paper, we present a meta-analysis of several Web content extraction algorithms, and make recommendations for the future of content extraction on the Web. First, we find that nearly all Web content extractors do not consider a very large, and growing, portion of modernWeb pages. Second, it is well understood that wrapper induction extractors tend to break as theWeb changes; ; heuristic/ feature engineering extractors were thought to be immune to a Web site's evolution, but we find that this is not the case: heuristic content extractor performance also tends to degrade over time due to the evolution of Web site forms and practices. We conclude with recommendations for future work that address these and other findings.

[1]  Nilesh N. Dalvi,et al.  Robust web extraction: an approach based on a probabilistic tree-edit model , 2009, SIGMOD Conference.

[2]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[3]  Mehmet A. Orgun,et al.  Separating XHTML content from navigation clutter using DOM-structure block analysis , 2005, HYPERTEXT '05.

[4]  Thomas Gottron,et al.  Readability and the Web , 2012, Future Internet.

[5]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[6]  Thomas Gottron Combining content extraction heuristics: the CombinE system , 2008, iiWAS.

[7]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[8]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[9]  Sandip Debnath,et al.  Automatic extraction of informative blocks from webpages , 2005, SAC '05.

[10]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[11]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[12]  Jayant Madhavan,et al.  Structured Data on the Web , 2009, 2010 12th International Asia-Pacific Web Conference.

[13]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[14]  Adam Kilgarriff,et al.  Cleaneval: a Competition for Cleaning Web Pages , 2008, LREC.

[15]  Thomas Gottron,et al.  Content Code Blurring: A New Approach to Content Extraction , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[16]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[17]  I. V. Ramakrishnan,et al.  Computational aspects of resilient data extraction from semistructured sources (extended abstract) , 2000, PODS '00.

[18]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.

[19]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[20]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[21]  Boris Chidlovskii,et al.  Documentum ECI self-repairing wrappers: performance analysis , 2006, SIGMOD Conference.

[22]  Wei Li,et al.  QuASM: a system for question answering using semi-structured data , 2002, JCDL '02.

[23]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[24]  Valter Crescenzi,et al.  WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES , 2008, Appl. Artif. Intell..

[25]  Calton Pu,et al.  Wrapping web data into XML , 2001, SGMD.

[26]  Sharma Chakravarthy,et al.  Automating Change Detection and Notification of Web Pages (Invited Paper) , 2006, 17th International Workshop on Database and Expert Systems Applications (DEXA'06).

[27]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[28]  Aditya G. Parameswaran,et al.  Optimal schemes for robust web extraction , 2011, Proc. VLDB Endow..