Data types used for big-data analysis are very widely, such as news, blog, SNS, papers, patents, sensed data, and etc. Particularly, the utilization of web documents offering reliable data in real time is increasing gradually. And web crawlers that collect web documents automatically have grown in importance because big-data is being used in many different fields and web data are growing exponentially every year. However, existing web crawlers can't collect whole web documents in a web site because existing web crawlers collect web documents with only URLs included in web documents collected in some web sites. Also, existing web crawlers can collect web documents collected by other web crawlers already because information about web documents collected in each web crawler isn't efficiently managed between web crawlers. Therefore, this paper proposed a distributed web crawler. To resolve the problems of existing web crawler, the proposed web crawler collects web documents by RSS of each web site and Google search API. And the web crawler provides fast crawling performance by a client-server model based on RMI and NIO that minimize network traffic. Furthermore, the web crawler extracts core content from a web document by a keyword similarity comparison on tags included in a web documents. Finally, to verify the superiority of our web crawler, we compare our web crawler with existing web crawlers in various experiments. ■ keyword :∣Web Crawler∣Content Extraction∣Big-data∣RMI∣NIO∣ 접수일자 : 2013년 11월 25일 수정일자 : 2013년 11월 28일 심사완료일 : 2013년 11월 28일 교신저자 : 정한민, e-mail : jhm@kisti.re.kr 한국콘텐츠학회논문지 '13 Vol. 13 No. 12 576
[1]
Hector Garcia-Molina,et al.
Efficient Crawling Through URL Ordering
,
1998,
Comput. Networks.
[2]
Hector Garcia-Molina,et al.
Parallel crawlers
,
2002,
WWW.
[3]
Won-Kyung Sung,et al.
Decision-Making Support Service Based on Technology Opportunity Discovery Model
,
2011,
FGIT-UNESST.
[4]
Marc Najork,et al.
Mercator: A scalable, extensible Web crawler
,
1999,
World Wide Web.
[5]
Wei-Ying Ma,et al.
VIPS: a Vision-based Page Segmentation Algorithm
,
2003
.
[6]
Peter Fankhauser,et al.
Boilerplate detection using shallow text features
,
2010,
WSDM '10.
[7]
Hwa-Mook Yoon,et al.
Development of Web Crawler for Archiving Web Resources
,
2011
.