Web key resource page selection based on non-content information
暂无分享,去创建一个
Information growth makes it impossible for search engines to crawl and index all pages on the Web.Meanwhile indexed page set is filled with low quality information and spam.It is quite a challenge to select high quality Web pages(key resource pages)query-independently.With analysis in non-content features of key resources,a pre-selection method was introduced in topic distillation research.A decision tree was constructed to locate key resource pages using query-independent non-content features including in-degree,document length,URL-type and two novel proposed features involving site's self-link structure analysis.Although the result page set contained only about 20% pages of the whole collection,it covered more than 70% of key resources.Furthermore,information retrieval on this page set made more than 60% improvement with respect to that on all pages.It shows an effective way to get better performance in topic distillation with a smaller data set.