Navigation objects extraction for better content structure understanding

Existing works for extracting navigation objects from webpages focus on navigation menus, so as to reveal the information architecture of the site. However, web 2.0 sites such as social networks, e-commerce portals etc. are making the understanding of the content structure in a web site increasingly difficult. Dynamic and personalized elements such as top stories, recommended list in a webpage are vital to the understanding of the dynamic nature of web 2.0 sites. To better understand the content structure in web 2.0 sites, in this paper we propose a new extraction method for navigation objects in a webpage. Our method will extract not only the static navigation menus, but also the dynamic and personalized page-specific navigation lists. Since the navigation objects in a webpage naturally come in blocks, we first cluster hyperlinks into different blocks by exploiting spatial locations of hyperlinks, the hierarchical structure of the DOM-tree and the hyperlink density. Then we identify navigation objects from those blocks using the SVM classifier with novel features such as anchor text lengths etc. Experiments on real-world data sets with webpages from various domains and styles verified the effectiveness of our method.

[1]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[2]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[3]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[4]  Shanchan Wu,et al.  Automatic Web Content Extraction by Combination of Learning and Grouping , 2015, WWW.

[5]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[6]  Matthias Keller,et al.  MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques , 2012, WWW.

[7]  Robert P. W. Duin,et al.  Feature Scaling in Support Vector Data Descriptions , 2000 .

[8]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[9]  Adam Kilgarriff,et al.  Cleaneval: a Competition for Cleaning Web Pages , 2008, LREC.

[10]  Keishi Tajima,et al.  Extracting Logical Hierarchical Structure of HTML Documents Based on Headings , 2015, Proc. VLDB Endow..

[11]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[12]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[13]  Natasa Milic-Frayling,et al.  Link Structure Graphs for Representing and Analyzing Web Sites , 2006 .

[14]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[15]  Ji-Rong Wen,et al.  Template-Independent News Extraction Based on Visual Consistency , 2007, AAAI.

[16]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[17]  Dan Roth,et al.  Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[18]  Hannes Hartenstein,et al.  Search result presentation: supporting post-search navigation by integration of taxonomy data , 2013, WWW '13 Companion.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[21]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[22]  Christopher C. Yang,et al.  Web site topic-hierarchy generation based on link structure , 2009, J. Assoc. Inf. Sci. Technol..

[23]  L. Hubert,et al.  Comparing partitions , 1985 .

[24]  Liang Chen,et al.  Template detection for large scale search engines , 2006, SAC '06.

[25]  Jiawei Han,et al.  Hierarchical Web-Page Clustering via In-Page and Cross-Page Link Structures , 2010, PAKDD.

[26]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[27]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[28]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[29]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[30]  Ravi Kumar,et al.  Hierarchical topic segmentation of websites , 2006, KDD '06.

[31]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[32]  Jian Pei,et al.  Can we learn a template-independent wrapper for news article extraction from a single training site? , 2009, KDD.

[33]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[34]  Jon Kleinberg,et al.  The Structure of the Web , 2001, Science.

[35]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.