Web page segmentation with structured prediction and its application in web page classification

We propose a framework which can perform Web page segmentation with a structured prediction approach. It formulates the segmentation task as a structured labeling problem on a transformed Web page segmentation graph (WPS-graph). WPS-graph models the candidate segmentation boundaries of a page and the dependency relation among the adjacent segmentation boundaries. Each labeling scheme on the WPS-graph corresponds to a possible segmentation of the page. The task of finding the optimal labeling of the WPS-graph is transformed into a binary Integer Linear Programming problem, which considers the entire WPS-graph as a whole to conduct structured prediction. A learning algorithm based on the structured output Support Vector Machine framework is developed to determine the feature weights, which is capable to consider the inter-dependency among candidate segmentation boundaries. Furthermore, we investigate its efficacy in supporting the development of automatic Web page classification.

[1]  Cornelia Caragea,et al.  Researcher homepage classification using unlabeled data , 2013, WWW.

[2]  Berthier A. Ribeiro-Neto,et al.  A site oriented method for segmenting web pages , 2011, SIGIR.

[3]  Dan Roth,et al.  Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[4]  Aaron Allen,et al.  What Frustrates Screen Reader Users on the Web: A Study of 100 Blind Users , 2007, Int. J. Hum. Comput. Interact..

[5]  Keiichiro Hoashi,et al.  Robust web page segmentation for mobile terminal using content-distances and page layout information , 2007, WWW '07.

[6]  Wei-Ying Ma,et al.  Detecting web page structure for adaptive viewing on small form factor devices , 2003, WWW '03.

[7]  Lidong Bing,et al.  Robust detection of semi-structured web records using a DOM structure-knowledge-driven model , 2013, TWEB.

[8]  Patrick Baudisch,et al.  Summary thumbnails: readable overviews for small screen web browsers , 2005, CHI.

[9]  Xiaoli Li,et al.  Using micro information units for internet search , 2002, CIKM '02.

[10]  Paul N. Bennett,et al.  Predicting content change on the web , 2013, WSDM.

[11]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[12]  M. Cugmas,et al.  On comparing partitions , 2015 .

[13]  Berthier A. Ribeiro-Neto,et al.  Using structural information to improve search in Web collections , 2010, J. Assoc. Inf. Sci. Technol..

[14]  Wolfgang Nejdl,et al.  A densitometric approach to web page segmentation , 2008, CIKM '08.

[15]  Qiang Hao,et al.  From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[16]  Jian Pei,et al.  Can we learn a template-independent wrapper for news article extraction from a single training site? , 2009, KDD.

[17]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[18]  Lidong Bing,et al.  Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning , 2013, WSDM.

[19]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[20]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[21]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[22]  Juliana Freire,et al.  A fast and robust method for web page template detection and removal , 2006, CIKM '06.

[23]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[24]  Lidong Bing,et al.  Structured positional entity language model for enterprise entity retrieval , 2013, CIKM.

[25]  Deepayan Chakrabarti,et al.  A graph-theoretic approach to webpage segmentation , 2008, WWW.

[26]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[27]  Shumeet Baluja,et al.  Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework , 2006, WWW '06.

[28]  Lidong Bing,et al.  Towards a unified solution: data record region detection and segmentation , 2011, CIKM '11.

[29]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[30]  Sandip Debnath,et al.  Automatic identification of informative sections of Web pages , 2005, IEEE Transactions on Knowledge and Data Engineering.

[31]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[32]  G. Nemhauser,et al.  Integer Programming , 2020 .

[33]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[34]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[35]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[36]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[37]  George B. Dantzig,et al.  Linear Programming 1: Introduction , 1997 .