Block-Based Language Modeling Approach Towards Web Search

Using probabilistic Language Modeling approach in Information Retrieval, model for each document is estimated individually. However, with Web pages becoming more complex, each of them may contain some blocks discussing different topics. Consequently, the performance of statistic model for web document tends to be degraded by the mixture of topics. In this paper, we argue that segmenting Web page into several relatively independent blocks will assist the language modeling and a Block-based Language Modeling (BLM) approach is proposed. Different with normal method, BLM refines the modeling process into two parts: the probability of a query occurring in a block, and the probability of a block occurring in a Web page. Then given a query, those pages with more relevant blocks tend to be retrieved. Experimental results show that when unigram model is used, our approach outperforms original language modeling for web search in most cases.

[1]  Timo Laakko,et al.  Two approaches to bringing Internet services to WAP devices , 2000, Comput. Networks.

[2]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[3]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[4]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[5]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[6]  Wei-Ying Ma,et al.  Improving pseudo-relevance feedback in web information retrieval using web page segmentation , 2003, WWW '03.

[7]  Baoyao Zhou,et al.  Function-based object model towards website adaptation , 2001, WWW '01.

[8]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[9]  Pedro Domingos KDD-2003 : proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003, Washington, DC, USA , 2003 .

[10]  HongJiang Zhang,et al.  HTML page analysis based on visual cues , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[11]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[12]  Ada Wai-Chee Fu,et al.  Finding Structure and Characteristics of Web Documents for Classification , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[13]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[14]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[15]  John Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR 1999.

[16]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[17]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.