Extracting Content from Web Pages Using the Sliding Window

Content extraction is an important technology for accessing and processing web information. In this paper, we propose a content extraction algorithm based on the sliding window. A statistical heuristic is used in the algorithm. Experiments show that our algorithm is capable of extracting most of the main content from web pages. With the simple and effective heuristic, the sliding window based algorithm shows a wide scope of application for most kinds of web pages.

[1]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[2]  Ben Wellner,et al.  Adaptive web-page content identification , 2007, WIDM '07.

[3]  Wei-Ying Ma,et al.  Learning important models for web page blocks based on layout and content analysis , 2004, SKDD.

[4]  Frederick H. Lochovsky,et al.  Data-rich section extraction from HTML pages , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[5]  Wei Li,et al.  QuASM: a system for question answering using semi-structured data , 2002, JCDL '02.

[6]  Salvatore J. Stolfo,et al.  Extracting context to improve accuracy for HTML content extraction , 2005, WWW '05.

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[9]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[10]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[11]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[12]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[13]  James A. M. McHugh,et al.  Mining the World Wide Web , 2001, The Information Retrieval Series.

[14]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[15]  Thomas Gottron Combining content extraction heuristics: the CombinE system , 2008, iiWAS.

[16]  Andreas Paepcke,et al.  Power browser: efficient Web browsing for PDAs , 2000, CHI.

[17]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[18]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[19]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.