When users need to analyze webpages related to some specific topics, generally they use crawlers to acquire webpages, and then analyze the results to extract those match the users’ interests. However, in data acquisition stage, users usually have customize demand on acquiring data. Ordinary crawler systems are very resource-constrained so they cannot traverse the entire internet. Meanwhile, search engines can satisfy these demand but it relies on many manual interactions. The traditional solution is to constrain the crawlers in some limited domain, but this will lead to the problem of low recall rate as well as inefficiency. In order to solve the problems above, this paper does some research on focused crawlers framework based on open search engine. It takes advantage of open search engine’s information gather and retrieval capabilities, and can automatically/semi-automatically generate the topic model to interpret and complete users search intents, with only a few seed keywords need to be provided initially. Then it uses open search engine interfaces to iteratively crawl topic-specific webpages. Compared with the traditional ways, the focused crawler based on open search engine proposed in this paper improves the recall rate and efficiency under the premise of ensuring the accuracy.
[1]
Yoelle Maarek,et al.
The Shark-Search Algorithm. An Application: Tailored Web Site Mapping
,
1998,
Comput. Networks.
[2]
Stephen E. Robertson,et al.
Understanding inverse document frequency: on theoretical arguments for IDF
,
2004,
J. Documentation.
[3]
Michael I. Jordan,et al.
Latent Dirichlet Allocation
,
2001,
J. Mach. Learn. Res..
[4]
Juliana Freire,et al.
Finding seeds to bootstrap focused crawlers
,
2015,
World Wide Web.
[5]
D. R. Patil,et al.
Efficient focused crawling based on best first search
,
2013,
2013 3rd IEEE International Advance Computing Conference (IACC).
[6]
Jon M Kleinberg,et al.
Hubs, authorities, and communities
,
1999,
CSUR.
[7]
Martin van den Berg,et al.
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery
,
1999,
Comput. Networks.
[8]
Philip S. Yu,et al.
Intelligent crawling on the World Wide Web with arbitrary predicates
,
2001,
WWW '01.
[9]
Rajeev Motwani,et al.
The PageRank Citation Ranking : Bringing Order to the Web
,
1999,
WWW 1999.