论文信息 - Web data cleansing for information retrieval using key resource page selection

Web data cleansing for information retrieval using key resource page selection

With the page explosion of WWW, how to cover more useful information with limited storage and computation resources becomes more and more important in web IR research. Using web page non-content feature analysis, we proposed a clustering-based method to select high quality pages from the whole page set. Although the result page set contains only 44.3% of the whole collection, it is related with more than 98% of links and covers about 90% of key information. Link property and retrieval affects are also observed and experiment results show that key resource selection method is more suitable for the job of data cleansing and the result page set outperforms the whole collection by smaller size and better retrieval performance.

Yiqun Liu | Min Zhang | Shaoping Ma | Canhui Wang

[1] Andrei Broder,et al. A taxonomy of web search , 2002, SIGF.

[2] David Hawking,et al. Overview of the TREC 2003 Web Track , 2003, TREC.

[3] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.