论文信息 - Learning probabilistic models of the Web (poster session)

Learning probabilistic models of the Web (poster session)

In the World Wide Web, myriads of hyperlinks connect documents and pages to create an unprecedented, highly complex graph structure - the Web graph. This paper presents a novel approach to learning probabilistic models of the Web, which can be used to make reliable predictions about connectivity and information content of Web documents. The proposed method is a probabilistic dimension reduction technique which recasts and unites Latent Semantic Analysis and Kleinberg's Hubs-and-Authorities algorithm in a statistical setting. This meant to be a first step towards the development of a statistical foundation for Web—related information technologies. Although this paper does not focus on a particular application, a variety of algorithms operating in the Web/Internet environment can take advantage of the presented techniques, including search engines, Web crawlers, and information agent systems.

Thomas Hofmann

[1] Krishna Bharat,et al. Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[2] Fernando Pereira,et al. Aggregate and mixed-order Markov models for statistical language processing , 1997, EMNLP.

[3] John D. Lafferty,et al. Information Retrieval as Statistical Translation , 2017 .

[4] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[5] Thomas Hofmann,et al. Learning Curved Multinomial Subfamilies for Natural Language Processing and Information Retrieval , 2000, ICML.

[6] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7] Thomas Hofmann,et al. Probabilistic latent semantic indexing , 1999, SIGIR '99.

[8] W. Bruce Croft,et al. A language modeling approach to information retrieval , 1998, SIGIR '98.