A Scalable Approach to Harvest Modern Weblogs

Blogs are one of the most prominent means of communication on the web. Their content, interconnections and influence constitute a unique socio-technical artefact of our times which needs to be preserved. The BlogForever project has established best practices and developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents the latest developments of the blog crawler which is a key component of the BlogForever platform. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using the blog's web feed in conjunction with blog hypertext. Furthermore, we present a system architecture which is characterised by efficiency, modularity, scalability and interoperability with third-party systems. Finally, we conduct thorough evaluations of the performance and accuracy of our system.

[1]  Eric C. Jensen,et al.  Metadata Encoding and Transmission Standard , 2009, Encyclopedia of Database Systems.

[2]  Steffen Staab,et al.  SXPath - Extending XPath towards Spatial Querying on Web Documents , 2010, Proc. VLDB Endow..

[3]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[4]  Lance Porter,et al.  Uses and Perceptions of Blogs: A Report on Professional Journalists and Journalism Educators , 2007 .

[5]  J. Wiest,et al.  The Arab Spring| Social Media in the Egyptian Revolution: Reconsidering Resource Mobilization Theory , 2011 .

[6]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  Nikos Kasioumis,et al.  Towards building a blog preservation platform , 2014, World Wide Web.

[9]  Douglas C. Schmidt,et al.  Active object: an object behavioral pattern for concurrent programming , 1996 .

[10]  Robert Hundt,et al.  Loop Recognition in C++/Java/Go/Scala , 2011 .

[11]  Marilena Oita,et al.  Archiving Data Objects using Web Feeds , 2010 .

[12]  Christoph Meinel,et al.  Mapping the Blogosphere--Towards a Universal and Scalable Blog-Crawler , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[13]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[14]  Kay G. Johnson Are Blogs Here to Stay?: An Examination of the Longevity and Currency of a Static List of Library and Information Science Weblogs , 2008 .

[15]  Tim Furche,et al.  OXPath: A language for scalable data extraction, automation, and crawling on the deep web , 2012, The VLDB Journal.

[16]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[17]  S. Amerio,et al.  EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH (CERN) , 2011 .

[18]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[20]  Sahibsingh A. Dudani The Distance-Weighted k-Nearest-Neighbor Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[21]  Charlie Lindahl,et al.  Weblogs: Simplifying Web Publishing , 2003, Computer.

[22]  Muhammad Faheem Intelligent crawling of web applications for web archiving , 2012, WWW.

[23]  Alexandra I. Cristea,et al.  Self-supervised Automated Wrapper Generation for Weblog Data Extraction , 2013, BNCOD.

[24]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[25]  Brian Lavoie Meeting the challenges of digital preservation: the OAIS reference model , 2000 .

[26]  Matthias Trier,et al.  The Blogosphere as Oeuvre: Individual and Collective Influence on Bloggers , 2012, ECIS 2012.

[27]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[28]  Linda Cantara METS: The Metadata Encoding and Transmission Standard , 2005 .