Improved classification via connectivity information

The motivat ion for our work is the observation that Web pages on a part icular topic are often linked to other pages on the same topic. We model and analyze the problem of how to improve the classification of Web pages ( that is, determining the topic of the page) by using link information. In our setting, an initial classifter examines the text of a Web page and assigns to it some classification, possibly mistaken. We investigate how to reduce the error probabili ty using the observation above, 'thus building an improved classifier. We present a theoretical framework for this problem based on a r andom graph model and suggest two linear t ime algorithms, based on similar methods that have been proven effective in the setting of error-correcting codes. We provide simulation results to verify our analysis and to compare the performance of our suggested

[1]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[2]  Daniel A. Spielman,et al.  Practical loss-resilient codes , 1997, STOC '97.

[3]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[4]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[5]  Michael Mitzenmacher,et al.  Analysis of random processes via And-Or tree evaluation , 1998, SODA '98.

[6]  Daniel A. Spielman,et al.  Improved low-density parity-check codes using irregular graphs and belief propagation , 1998, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).

[7]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[8]  M. Luby,et al.  Improved low-density parity-check codes using irregular graphs and belief propagation , 1998, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).

[9]  A. Glavieux,et al.  Near Shannon limit error-correcting coding and decoding: Turbo-codes. 1 , 1993, Proceedings of ICC '93 - IEEE International Conference on Communications.

[10]  Daniel A. Spielman,et al.  Analysis of low density codes and improved designs using irregular graphs , 1998, STOC '98.

[11]  Rüdiger L. Urbanke,et al.  The capacity of low-density parity-check codes under message-passing decoding , 2001, IEEE Trans. Inf. Theory.

[12]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[13]  Éva Tardos,et al.  Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 2002, JACM.

[14]  David J. C. MacKay,et al.  Low-density parity check codes over GF(q) , 1998, IEEE Communications Letters.

[15]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[16]  T. Richardson,et al.  Design of provably good low-density parity check codes , 2000, 2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060).