Web data cleansing for information retrieval using key resource page selection
暂无分享,去创建一个
With the page explosion of WWW, how to cover more useful information with limited storage and computation resources becomes more and more important in web IR research. Using web page non-content feature analysis, we proposed a clustering-based method to select high quality pages from the whole page set. Although the result page set contains only 44.3% of the whole collection, it is related with more than 98% of links and covers about 90% of key information. Link property and retrieval affects are also observed and experiment results show that key resource selection method is more suitable for the job of data cleansing and the result page set outperforms the whole collection by smaller size and better retrieval performance.
[1] Andrei Broder,et al. A taxonomy of web search , 2002, SIGF.
[2] David Hawking,et al. Overview of the TREC 2003 Web Track , 2003, TREC.
[3] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.