Web Page Classification Based on Web Page Size and Hyperlinks and Web Site Hyperlink Structure

This paper presents a new metric, Page Rank × Inverse Links-toword count Ratio (PR × ILW), used in classifying web pages as content or navigation. The metric combines web page size and number of hyperlinks on a page, and the page rank metric based on website topology, to compute the new metric. We present a theoretical basis for the new metric, and the results of a web page classification study, which show that the new metric, when combined with the links-to-word count ratio of web pages, accurately classifies the pages into the two categories.

[1]  Robert W. Reeder,et al.  Information scent as a driver of Web behavior graphs: results of a protocol analysis method for Web usability , 2001, CHI.

[2]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[3]  Mary Czerwinski,et al.  Web page design: implications of memory, structure and scent for information retrieval , 1998, CHI.

[4]  Yoichi Shinoda,et al.  Information filtering based on user behavior analysis and best match text retrieval , 1994, SIGIR '94.

[5]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[6]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[7]  Bradley N. Miller,et al.  GroupLens: applying collaborative filtering to Usenet news , 1997, CACM.

[8]  Philip S. Yu,et al.  Data mining for path traversal patterns in a web environment , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[9]  Mark Claypool,et al.  Implicit interest indicators , 2001, IUI '01.

[10]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[11]  Umeshwar Dayal,et al.  From User Access Patterns to Dynamic Hypertext Linking , 1996, Comput. Networks.

[12]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[13]  James E. Pitkow,et al.  In Search of Reliable Usage Data on the WWW , 1997, Comput. Networks.

[14]  Nicholas J. Belkin,et al.  Reading time, scrolling and interaction: exploring implicit sources of user preferences for relevance feedback , 2001, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[15]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[16]  James D. Hollan,et al.  Edit wear and read wear , 1992, CHI.

[17]  Douglas W. Oard,et al.  Implicit Feedback for Recommender Systems , 1998 .

[18]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[19]  Peter Pirolli,et al.  Computational models of information scent-following in a very large browsable text collection , 1997, CHI.

[20]  Douglas W. Oard,et al.  User Modeling for Information Filtering Based on Implicit Feedback , 2001 .

[21]  Wai-Tat Fu,et al.  SNIF-ACT: A Model of Information Foraging on the World Wide Web , 2003, User Modeling.

[22]  David M. Nichols,et al.  Implicit Rating and Filtering , 1998 .