Focused crawler for the acquisition of health articles

The health intervention by using technology can be the alternative to the doctor, especially for common health problem. To support the technology, we need health knowledge base as the foundation. The artificial intelligence and hardware development nowadays support this requirement. The big picture of our research is building the application that can utilize the health knowledge base to provide health intervention. As the first step, we collect the articles related to health. To realize it, we build the focused crawler that implements multithreaded programming, Larger-Sites-First algorithm and also Naïve Bayes classifier. We find that the articles acquisition is going to saturate along with the increment of threads. Furthermore, the implementation of Larger-Sites-First algorithm do increase the number of crawled articles, but it is not significant. In addition, Naïve Bayes recognizes ≥ 90 percent articles in perfect condition for both health and non-health category. However, the performance goes down when recognizing the non-health articles which contain health keywords.

[1]  Haizhou Wang,et al.  A Focused Crawler Based on Naive Bayes Classifier , 2010, 2010 Third International Symposium on Intelligent Information Technology and Security Informatics.

[2]  Dongmin Yang,et al.  Classification Scheme of Unstructured Text Document using TF-IDF and Naive Bayes Classifier , 2015 .

[3]  Shruti Sharma,et al.  The anatomy of web crawlers , 2015, International Conference on Computing, Communication & Automation.

[4]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[5]  F. Tala A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia , 2003 .

[6]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[7]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[8]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[9]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[10]  Deren Chen,et al.  URL Rule Based Focused Crawler , 2008, 2008 IEEE International Conference on e-Business Engineering.

[11]  Dong Chen,et al.  Semantic focused crawler based on Q-learning and Bayes classifier , 2010, 2010 3rd International Conference on Computer Science and Information Technology.