Webpage understanding: beyond page-level search

In this paper we introduce the webpage understanding problem which consists of three subtasks: webpage segmentation, webpage structure labeling, and webpage text segmentation and labeling. The problem is motivated by the search applications we have been working on including Microsoft Academic Search, Windows Live Product Search and Renlifang Entity Relationship Search. We believe that integrated webpage understanding will be an important direction for future research in Web mining.

[1]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[2]  Veljko M. Milutinovic,et al.  Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[3]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[4]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[5]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[6]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[7]  Bo Zhang,et al.  Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction , 2008, J. Mach. Learn. Res..

[8]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[9]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[10]  Wei-Ying Ma,et al.  Object-level Vertical Search , 2007, CIDR.

[11]  Wei-Ying Ma,et al.  Object-level ranking: bringing order to Web objects , 2005, WWW '05.

[12]  Ji-Rong Wen,et al.  Closing the Loop in Webpage Understanding , 2008, IEEE Transactions on Knowledge and Data Engineering.

[13]  Daniel DiPasquo,et al.  Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web , 1998 .

[14]  Wei-Ying Ma,et al.  Web object retrieval , 2007, WWW '07.

[15]  Bo Zhang,et al.  Webpage understanding: an integrated approach , 2007, KDD '07.