Turn the Page: Automated Traversal of Paginated Websites

Content-intensive web sites, such as Google or Amazon, paginate their results to accommodate limited screen sizes. Thus, human users and automatic tools alike have to traverse the pagination links when they crawl the site, extract data, or automate common tasks, where these applications require access to the entire result set. Previous approaches, as well as existing crawlers and automation tools, rely on simple heuristics (e.g., considering only the link text), falling back to an exhaustive exploration of the site where those heuristics fail. In particular, focused crawlers and data extraction systems target only fractions of the individual pages of a given site, rendering a highly accurate identification of pagination links essential to avoid the exhaustive exploration of irrelevant pages. We identify pagination links in a wide range of domains and sites with near perfect accuracy (99%). We obtain these results with a novel framework for web block classification, ${\textsc{ber}_y{\textsc l}}$, that combines rule-based reasoning for feature extraction and machine learning for feature selection and classification. Through this combination, ${\textsc{ber}_y{\textsc l}}$ is applicable in a wide settings range, adjusted to maximise either precision, recall, or speed. We illustrate how ${\textsc{ber}_y{\textsc l}}$ minimises the effort for feature extraction and evaluate the impact of a broad range of features (content, structural, and visual).

[1]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[2]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[3]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4]  Min-Yen Kan,et al.  Stylistic and lexical co-training for web block classification , 2004, WIDM '04.

[5]  Filippo Menczer,et al.  A General Evaluation Framework for Topical Crawlers , 2005, Information Retrieval.

[6]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[7]  Andrea Tagarelli,et al.  Schema-based Web wrapping , 2004, Knowledge and Information Systems.

[8]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[9]  Evangelos E. Milios,et al.  Focused Crawling by Learning HMM from User's Topic-specific Browsing , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[10]  Joongmin Choi,et al.  Block Classification of a Web Page by Using a Combination of Multiple Classifiers , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[11]  Ioannis Pitas,et al.  Combining text and link analysis for focused crawling - An application for vertical search engines , 2007, Inf. Syst..

[12]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[13]  Evangelos E. Milios,et al.  Using HMM to learn user browsing patterns for focused Web crawling , 2006, Data & Knowledge Engineering.

[14]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[15]  Xin Yang,et al.  Learning Web Page Block Functions using Roles of Images , 2008, 2008 Third International Conference on Pervasive Computing and Applications.

[16]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[17]  Kalina Bontcheva,et al.  Text Processing with GATE , 2011 .

[18]  Ji-Rong Wen,et al.  Efficient record-level wrapper induction , 2009, CIKM.

[19]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[20]  Patricia Bouyer,et al.  Improved undecidability results on weighted timed automata , 2006, Inf. Process. Lett..

[21]  Jian Pei,et al.  Can we learn a template-independent wrapper for news article extraction from a single training site? , 2009, KDD.