RetriBlog: a framework for creating blog crawlers

Blogs are becoming an important social tool. By means of blogs, bloggers share their likes and dislikes, express their opinions, report news and form groups related to some subjects. Thus, the available information on the Blogsphere can certainly helps in the creation of interesting applications in various domains, such as e-learning, e-commerce, and e-government. However, due to the increasing number of blogs posted every day on the Web, and the dynamic nature of the Blogsphere, the tasks of collecting and extracting relevant information from blogs have become hard and time consuming. In this paper, we use techniques both from information retrieval and information extraction fields to deal with this problem. Since the blogs have many points of variability it is necessary to provide applications that can be easily adapted. We present the RetriBlog system, a framework for the development of blog crawlers dealing the variations in blogs. This paper presents the RetriBlog details and an evaluation of the proposed algorithms.

[1]  K. Fujimura,et al.  BLOGRANGER – A Multi-faceted Blog Search Engine , 2006 .

[2]  Bernardo A. Huberman,et al.  Usage patterns of collaborative tagging systems , 2006, J. Inf. Sci..

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  Tim Weninger,et al.  Text Extraction from the Web via Text-to-Tag Ratio , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[5]  Otis Gospodnetic,et al.  Lucene in Action (In Action series) , 2004 .

[6]  Wei Li,et al.  QuASM: a system for question answering using semi-structured data , 2002, JCDL '02.

[7]  Jennifer Jie Xu,et al.  A Blog Mining Framework , 2009, IT Professional.

[8]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[9]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[10]  Hua Qian,et al.  Anonymity and Self-Disclosure on Weblogs , 2007, J. Comput. Mediat. Commun..

[11]  Frederic P. Miller,et al.  Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance , 2009 .

[12]  Shengyi Jiang,et al.  An improved K-nearest-neighbor algorithm for text categorization , 2012, Expert Syst. Appl..

[13]  Thomas Gottron EVALUATING CONTENT EXTRACTION ON HTML DOCUMENTS , 2007 .

[14]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[15]  Mukul Joshi,et al.  BlogHarvest: Blog Mining and Search Framework , 2006, COMAD.

[16]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[17]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .