Learning-based Web Data Cleansing for Information Retrieval *

With rapid growth of web information, how to select high quality web pages that cover valuable information query-independently becomes more and more important in web IR research. Based on query-independent feature analysis, we propose a data cleansing algorithm by selecting an important type of high quality pages (key resources) on the web. Study into the cleansed page set shows that the set contains only 44.3% pages of the whole collection, while involves more than 98% of hyperlinks and covers about 90% of key information. Experiments based on TREC 2003 data show that the cleansed collection outperforms the whole collection by less than a half size and 8% improvement of retrieval performance.