Learning Document Labels from Enriched Click Graphs

Document classification plays an increasingly important role in extracting and organizing the knowledge, however, the Web document classification task was hindered by the huge number of Web documents while limited resource of human judgment on the training data. To obtain sufficient training data in a cost-efficient way, in this paper, we propose a semi-supervised learning approach to predict a document’s class label by mining the click graph. To overcome the sparseness problem of click graph, we enrich it by including hyperlinks between the Web documents. Content-based constraints are further added to regularize the graph. The resulting graph unifies three data sources: click-through data, hyperlinks and content relevance. Starting from a very small seed set of manually labeled documents, we automatically explore large amount of relevant documents by applying a Markov random walk model to the enriched click graph. The top pages with high confidence scores are included to the current training data for classifier model training. We investigate various combinations among the three sources and conduct extensive experiments on six typical web classification tasks. The experimental results show that the click graph enriched with hyperlink and content information can significantly improve the classification quality across multiple tasks only with a minimal human labeling cost.

[1]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[2]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[3]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[4]  Rong Jin,et al.  Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization , 2008, NIPS.

[5]  Jie Li,et al.  Characterizing typical and atypical user sessions in clickstreams , 2008, WWW.

[6]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[7]  Mohammed Benkhalifa,et al.  Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization , 2004, Information Retrieval.

[8]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[9]  Yiming Yang,et al.  Hypertext Categorization using Hyperlink Patterns and Meta Data , 2001, ICML.

[10]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[11]  Wei-Ying Ma,et al.  Optimizing web search using web click-through data , 2004, CIKM '04.

[12]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[13]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[14]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[15]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[16]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[17]  Qiang Yang,et al.  Web-page summarization using clickthrough data , 2005, SIGIR '05.

[18]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[19]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[20]  Michael R. Lyu,et al.  Learning latent semantic relations from clickthrough data for query suggestion , 2008, CIKM '08.

[21]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[22]  Zenglin Xu,et al.  Semi-supervised text categorization by active search , 2008, CIKM '08.

[23]  Qiang Yang,et al.  Building bridges for web query classification , 2006, SIGIR.

[24]  Xiaojie Yuan,et al.  Are click-through data adequate for learning web search rankings? , 2008, CIKM '08.