RetriBlog: An architecture-centered framework for developing blog crawlers

Blogs have become an important social tool. It allows the users to share their tastes, express their opinions, report news, form groups related to some subject, among others. The information obtained from the blogosphere may be used to create several applications in various fields. However, due to the growing number of blogs posted every day, as well as the dynamicity of the blogosphere, the task of extracting relevant information from the blogs has become difficult and time consuming. In this paper, we use information retrieval and extraction techniques to deal with this problem. Furthermore, as blogs have many variation points is required to provide applications that can be easily adapted. Faced with this scenario, the work proposes RetriBlog, an architecture-centered framework for the development of blog crawlers. Finally, it presents an evaluation of the proposed algorithms and three case studies.

[1]  Clemens A. Szyperski,et al.  Component software - beyond object-oriented programming , 2002 .

[2]  Takao Terano,et al.  Blog information considered useful for book sales prediction , 2010, 2010 7th International Conference on Service Systems and Service Management.

[3]  Kyo Chul Kang,et al.  Feature Dependency Analysis for Product Line Component Design , 2004, ICSR.

[4]  Robert Fox,et al.  Workers going postal , 1997 .

[5]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[7]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[8]  Philippe Kruchten,et al.  The Past, Present, and Future for Software Architecture , 2006, IEEE Software.

[9]  Frederic P. Miller,et al.  Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance , 2009 .

[10]  Xuedong Wang,et al.  The Information Filtering under the Web 2.0 Environment , 2008, 2008 International Conference on Information Management, Innovation Management and Industrial Engineering.

[11]  Shankara B. Subramanya,et al.  Socialtagger - collaborative tagging for blogs in the long tail , 2008, SSM '08.

[12]  Marcus E. Markiewicz,et al.  Object oriented framework development , 2001, CROS.

[13]  Jennifer Jie Xu,et al.  A Blog Mining Framework , 2009, IT Professional.

[14]  Xuesong Yan,et al.  Survey of Improving Naive Bayes for Classification , 2007, ADMA.

[15]  Frederick P. Brooks,et al.  No Silver Bullet: Essence and Accidents of Software Engineering , 1987 .

[16]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[17]  Mary Shaw,et al.  An Introduction to Software Architecture , 1993, Advances in Software Engineering and Knowledge Engineering.

[18]  Ralph E. Johnson,et al.  Frameworks = (components + patterns) , 1997, CACM.

[19]  Ig Ibert Bittencourt,et al.  A computational model for developing semantic web-based educational systems , 2009, Knowl. Based Syst..

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Huan Liu,et al.  BlogTrackers: A Tool for Sociologists to Track and Analyze Blogosphere , 2009, ICWSM.

[22]  Bernardo A. Huberman,et al.  Usage patterns of collaborative tagging systems , 2006, J. Inf. Sci..

[23]  Matthew Hurst,et al.  Social Streams Blog Crawler , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[24]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[25]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[26]  Douglas C. Schmidt,et al.  Building application frameworks: object-oriented foundations of framework design , 1999 .

[27]  Agostino Poggi,et al.  Developing Multi-agent Systems with JADE , 2007, ATAL.

[28]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .

[29]  Songbo Tan,et al.  An effective refinement strategy for KNN text classifier , 2006, Expert Syst. Appl..

[30]  Nick Koudas,et al.  Searching the Blogosphere , 2007, WebDB.

[31]  Paul Clements,et al.  Software product lines - practices and patterns , 2001, SEI series in software engineering.

[32]  Xiaohui Yang Improving Teachers' Knowledge Management with Blog Platform , 2008, 2008 International Workshop on Education Technology and Training & 2008 International Workshop on Geoscience and Remote Sensing.

[33]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[34]  Mukul Joshi,et al.  BlogHarvest: Blog Mining and Search Framework , 2006, COMAD.

[35]  Fabio Bellifemine,et al.  Developing Multi-Agent Systems with JADE (Wiley Series in Agent Technology) , 2007 .

[36]  Krzysztof Czarnecki,et al.  Staged configuration through specialization and multilevel configuration of feature models , 2005, Softw. Process. Improv. Pract..

[37]  Desmond D'Souza,et al.  Objects, Components, and Frameworks with UML: The Catalysis Approach , 1998 .

[38]  Ralph E. Johnson,et al.  Components, frameworks, patterns , 1997, SSR '97.

[39]  Brian Foote,et al.  Designing Reusable Classes , 2001 .

[40]  Hassan Gomaa Designing Software Product Lines with UML 2.0: From Use Cases to Pattern-Based Software Architectures , 2006, ICSR.

[41]  Shengyi Jiang,et al.  An improved K-nearest-neighbor algorithm for text categorization , 2012, Expert Syst. Appl..

[42]  Tim Weninger,et al.  Text Extraction from the Web via Text-to-Tag Ratio , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[43]  Otis Gospodnetic,et al.  Lucene in Action (In Action series) , 2004 .

[44]  Wei Li,et al.  QuASM: a system for question answering using semi-structured data , 2002, JCDL '02.

[45]  Andrew Tomkins,et al.  Guest Editors' Introduction: Social Media and Search , 2007, IEEE Internet Computing.

[46]  Shuwei Wang,et al.  Study on Application Strategies of Blog in Information-based Teaching , 2009, 2009 International Joint Conference on Artificial Intelligence.

[47]  Rebecca Blood,et al.  How blogging software reshapes the online community , 2004, CACM.

[48]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[49]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[50]  Shiwen Yu,et al.  An Improved k-Nearest Neighbor Algorithm for Text Categorization , 2003, ArXiv.

[51]  Paul Clements,et al.  Software architecture in practice , 1999, SEI series in software engineering.

[52]  Trevor Darrell,et al.  MULTIMODAL INTERFACES THAT Flex, Adapt, and Persist , 2004 .