BlogForever Crawler: Techniques and Algorithms to Harvest Modern Weblogs

Blogs are a dynamic communication medium which has been widely established on the web. The BlogForever project has developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents a key component of the BlogForever platform, the web crawler. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple and robust algorithm to generate extraction rules based on string matching using the blog's web feed in conjunction with blog hypertext. This approach leads to a scalable blog data extraction process. Furthermore, we show how we integrate a web browser into the web harvesting process in order to support data extraction from blogs with JavaScript generated content.

[1]  Marilena Oita,et al.  Archiving Data Objects using Web Feeds , 2010 .

[2]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[3]  Kay G. Johnson Are Blogs Here to Stay?: An Examination of the Longevity and Currency of a Static List of Library and Information Science Weblogs , 2008 .

[4]  Sahibsingh A. Dudani The Distance-Weighted k-Nearest-Neighbor Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  J. Wiest,et al.  The Arab Spring| Social Media in the Egyptian Revolution: Reconsidering Resource Mobilization Theory , 2011 .

[6]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[9]  Charlie Lindahl,et al.  Weblogs: Simplifying Web Publishing , 2003, Computer.

[10]  Muhammad Faheem Intelligent crawling of web applications for web archiving , 2012, WWW.

[11]  Alexandra I. Cristea,et al.  Self-supervised Automated Wrapper Generation for Weblog Data Extraction , 2013, BNCOD.

[12]  Nikos Kasioumis,et al.  Towards building a blog preservation platform , 2014, World Wide Web.

[13]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[14]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[15]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[17]  Christoph Meinel,et al.  Mapping the Blogosphere--Towards a Universal and Scalable Blog-Crawler , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[18]  Robert Hundt,et al.  Loop Recognition in C++/Java/Go/Scala , 2011 .

[19]  Douglas C. Schmidt,et al.  Active object: an object behavioral pattern for concurrent programming , 1996 .

[20]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.