A user-oriented web crawler for selectively acquiring online content in e-health research

MOTIVATION Life stories of diseased and healthy individuals are abundantly available on the Internet. Collecting and mining such online content can offer many valuable insights into patients' physical and emotional states throughout the pre-diagnosis, diagnosis, treatment and post-treatment stages of the disease compared with those of healthy subjects. However, such content is widely dispersed across the web. Using traditional query-based search engines to manually collect relevant materials is rather labor intensive and often incomplete due to resource constraints in terms of human query composition and result parsing efforts. The alternative option, blindly crawling the whole web, has proven inefficient and unaffordable for e-health researchers. RESULTS We propose a user-oriented web crawler that adaptively acquires user-desired content on the Internet to meet the specific online data source acquisition needs of e-health researchers. Experimental results on two cancer-related case studies show that the new crawler can substantially accelerate the acquisition of highly relevant online content compared with the existing state-of-the-art adaptive web crawling technology. For the breast cancer case study using the full training set, the new method achieves a cumulative precision between 74.7 and 79.4% after 5 h of execution till the end of the 20-h long crawling session as compared with the cumulative precision between 32.8 and 37.0% using the peer method for the same time period. For the lung cancer case study using the full training set, the new method achieves a cumulative precision between 56.7 and 61.2% after 5 h of execution till the end of the 20-h long crawling session as compared with the cumulative precision between 29.3 and 32.4% using the peer method. Using the reduced training set in the breast cancer case study, the cumulative precision of our method is between 44.6 and 54.9%, whereas the cumulative precision of the peer method is between 24.3 and 26.3%; for the lung cancer case study using the reduced training set, the cumulative precisions of our method and the peer method are, respectively, between 35.7 and 46.7% versus between 24.1 and 29.6%. These numbers clearly show a consistently superior accuracy of our method in discovering and acquiring user-desired online content for e-health research. AVAILABILITY AND IMPLEMENTATION The implementation of our user-oriented web crawler is freely available to non-commercial users via the following Web site: http://bsec.ornl.gov/AdaptiveCrawler.shtml. The Web site provides a step-by-step guide on how to execute the web crawler implementation. In addition, the Web site provides the two study datasets including manually labeled ground truth, initial seeds and the crawling results reported in this article.

[1]  C. Lee Giles,et al.  What's there and what's not?: focused crawling for missing documents in digital libraries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[2]  Wenfei Fan,et al.  Keys for XML , 2001, WWW '01.

[3]  Hiroyuki Kitagawa,et al.  An Extended Method for Finding Related Web Pages with Focused Crawling Techniques , 2011, KES.

[4]  Charu C. Aggarwal,et al.  Collaborative crawling: mining user experiences for topical resource discovery , 2002, KDD.

[5]  Andrew McCallum,et al.  Piecewise pseudolikelihood for efficient training of conditional random fields , 2007, ICML '07.

[6]  Hyun Chul Lee,et al.  Geographically focused collaborative crawling , 2006, WWW '06.

[7]  M. Narasimha Murty,et al.  Focused crawling with scalable ordinal regression solvers , 2007, ICML '07.

[8]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[9]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[10]  Chun Chen,et al.  Guide focused crawler efficiently and effectively using on-line topical importance estimation , 2008, SIGIR '08.

[11]  Hisham M. Haddad,et al.  Proceedings of the 2008 ACM Symposium on Applied Computing (SAC), Fortaleza, Ceara, Brazil, March 16-20, 2008 , 2008, SAC.

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Charles L. A. Clarke,et al.  Topic-oriented collaborative crawling , 2002, CIKM '02.

[14]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[15]  Sameer Singh,et al.  Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I , 2005 .

[16]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[17]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[18]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[19]  Susanne Boll,et al.  Adaptive geospatially focused crawling , 2009, CIKM.

[20]  Constantine Kotropoulos,et al.  Combining Text and Link Analysis for Focused Crawling , 2005, ICAPR.

[21]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[22]  Hsinchun Chen,et al.  Sentimental Spidering: Leveraging Opinion Information in Focused Crawlers , 2012, TOIS.

[23]  Debakar Shamanta,et al.  Focused web crawling: A framework for crawling of country based financial data , 2010, 2010 2nd IEEE International Conference on Information and Financial Engineering.

[24]  Andreas Dengel,et al.  Proceedings of the 15th international conference on Knowledge-based and intelligent information and engineering systems - Volume Part III , 2005 .

[25]  Alfred Kobsa,et al.  The Adaptive Web, Methods and Strategies of Web Personalization , 2007, The Adaptive Web.

[26]  Ioannis Pitas,et al.  Focused Crawling Using Latent Semantic Indexing - An Application for Vertical Search Engines , 2005, ECDL.

[27]  Fabio Gasparetti,et al.  Adaptive Focused Crawling , 2007, The Adaptive Web.

[28]  Shuichiro Yamamotoa,et al.  th International Conference on Knowledge Based and Intelligent Information and Engineering Systems , 2016 .

[29]  Evangelos E. Milios,et al.  PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING , 2004, WIDM '04.

[30]  Euripides G. M. Petrakis,et al.  Improving the performance of focused web crawlers , 2009, Data Knowl. Eng..

[31]  Bo Yuan,et al.  A cross-language focused crawling algorithm based on multiple relevance prediction strategies , 2009, Comput. Math. Appl..

[32]  Antonio Badia,et al.  Focused crawling: experiences in a real world project , 2006, WWW '06.

[33]  Guilherme Tavares de Assis,et al.  The impact of term selection in genre-aware focused crawling , 2008, SAC '08.