A Novel Web Page Categorization Algorithm Based on Block Propagation Using Query-Log Information

Most existing web page classification algorithms, including content-based, link-based, or query-log analysis methods, treat the pages as smallest units. However, web pages usually contain some noisy or biased information which could affect the performance of classification. In this paper, we propose a Block Propagation Categorization (BPC) algorithm which deep mines web structure and views blocks as basic semantic units. Moreover, with query log information, BPC propagates only suitable information (block) among web pages to emphasize their topics. We also optimize the BPC algorithm to significantly speed up the block propagation process, without losing any precision. Our experiments on ODP and MSN search engine log show that BPC achieves a great improvement over traditional approaches.

[1]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[2]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[3]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[4]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[5]  Nadejda Panteleeva Using neighborhood information for automated categorization of Web pages , 2003, ISTA.

[6]  Hongjun Lu,et al.  ReCoM: reinforcement clustering of multi-type interrelated data objects , 2003, SIGIR.

[7]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[8]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[9]  Tom M. Mitchell,et al.  Discovering Test Set Regularities in Relational Domains , 2000, ICML.

[10]  Shui-Lung Chuang,et al.  Enriching Web taxonomies through subject categorization of query terms from search engine logs , 2003, Decis. Support Syst..

[11]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[12]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[16]  Wei-Ying Ma,et al.  IRC: an iterative reinforcement categorization algorithm for interrelated Web objects , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[19]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[20]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[21]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[22]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.