Reinforcing Web-object Categorization Through Interrelationships

Existing categorization algorithms deal with homogeneous Web objects, and consider interrelated objects as additional features when taking the interrelationships with other types of objects into account. However, focusing on any single aspect of the inter-object relationship is not sufficient to fully reveal the true categories of Web objects. In this paper, we propose a novel categorization algorithm, called the Iterative Reinforcement Categorization Algorithm (IRC), to exploit the full interrelationship between different types of Web objects on the Web, including Web pages and queries. IRC classifies the interrelated Web objects by iteratively reinforcing the individual classification results of different types of objects via their interrelationship. Experiments on a clickthrough-log dataset from the MSN search engine show that, in terms of the F1 measure, IRC achieves a 26.4% improvement over a pure content-based classification method. It also achieves a 21% improvement over a query-metadata-based method, as well as a 16.4% improvement on F1 measure over the well-known virtual document-based method. Our experiments show that IRC converges fast enough to be applicable to real world applications.

[1]  Tom M. Mitchell,et al.  Discovering Test Set Regularities in Relational Domains , 2000, ICML.

[2]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[3]  Hongjun Lu,et al.  ReCoM: reinforcement clustering of multi-type interrelated data objects , 2003, SIGIR.

[4]  G. Grimmett,et al.  Probability and random processes , 2002 .

[5]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[6]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[9]  Ben Taskar,et al.  Probabilistic Models of Text and Link Structure for Hypertext Classification , 2001 .

[10]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[11]  Xindong Wu,et al.  A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases , 2005, DaWaK.

[12]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[13]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[15]  Shui-Lung Chuang,et al.  Enriching Web taxonomies through subject categorization of query terms from search engine logs , 2003, Decis. Support Syst..

[16]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[17]  Ben Taskar,et al.  Learning Probabilistic Models of Relational Structure , 2001, ICML.

[18]  Xindong Wu,et al.  Knowledge Discovery in Multiple Databases , 2004, ICTAI.

[19]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[20]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[21]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[22]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.