Improving Data Collection on Article Clustering by Using Distributed Focused Crawler

Collecting or harvesting data from the Internet is often done by using web crawler. General web crawler is developed to be more focus on certain topic. The type of this web crawler called focused crawler. To improve the datacollection performance, creating focused crawler is not enough as the focused crawler makes efficient usage of network bandwidth and storage capacity. This research proposes a distributed focused crawler in order to improve the web crawler performance which also efficient in network bandwidth and storage capacity. This distributed focused crawler implements crawling scheduling, site ordering to determine URL queue, and focused crawler by using Naive Bayes. This research also tests the web crawling performance by conducting multithreaded, then observe the CPU and memory utilization. The conclusion is the web crawling performance will be decrease when too many threads are used. As the consequences, the CPU and memory utilization will be very high, meanwhile performance of the distributed focused crawler will be low.

[1]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[2]  Haizhou Wang,et al.  A Focused Crawler Based on Naive Bayes Classifier , 2010, 2010 Third International Symposium on Intelligent Information Technology and Security Informatics.

[3]  Wahyu Catur Wibowo,et al.  A Fast Distributed Focused-Web Crawling , 2014 .

[4]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[5]  G. Geetha,et al.  Smart distributed web crawler , 2016, 2016 International Conference on Information Communication and Embedded Systems (ICICES).

[6]  Rahmat Budiarto,et al.  Automatic Text Summarization for Indonesian Language Using TextTeaser , 2017 .

[7]  Dani Gunawan,et al.  Focused crawler for the acquisition of health articles , 2016, 2016 International Conference on Data and Software Engineering (ICoDSE).

[8]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[9]  Shruti Sharma,et al.  The anatomy of web crawlers , 2015, International Conference on Computing, Communication & Automation.