Parallelization in Extracting Fresh Information from Online Social Network

Online Social Network (OSN) is one of the most hottest services in the past years. It preserves the life of users and provides great potential for journalists, sociologists and business analysts. Crawling data from social network is a basic step for social network information analysis and processing. As the net- work becomes huge and information on the network updates faster than web pages, crawling is more dicult because of the limitations of band-width, po- liteness etiquette and computation power. To extract fresh information from social network eciently and eectively, this paper presents a novel crawling method and discusses parallelization architecture of social network. To dis- cover the feature of social network, we gather data from real social network, analyze them and build a model to describe the discipline of users' behavior. With the modeled behavior, we propose methods to predict users' behavior. According to the prediction, we schedule our crawler more reasonably and ex- tract more fresh information with parallelization technologies. Experimental results demonstrate that our strategies could obtain information from OSN eciently and eectively.

[1]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[2]  Alexander Lazovik,et al.  Mining Twitter in the Cloud: A Case Study , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[3]  Mizuki Morita,et al.  Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter , 2011, EMNLP.

[4]  Paolo Toth,et al.  An exact algorithm for the subset sum problem , 2002, Eur. J. Oper. Res..

[5]  Ricardo A. Baeza-Yates,et al.  Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[6]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[7]  Gerhard Weikum,et al.  SHARC: Framework for Quality-Conscious Web Archiving , 2009, Proc. VLDB Endow..

[8]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[9]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[10]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[11]  Junghoo Cho,et al.  RankMass crawler: a crawler with high personalized pagerank coverage guarantee , 2007, VLDB 2007.

[12]  Aristides Gionis,et al.  Design trade-offs for search engine caching , 2008, TWEB.

[13]  José Martins,et al.  TwitterEcho: a distributed focused crawler to support open research with twitter data , 2012, WWW.

[14]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[15]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[16]  Stephan Mertens The Easiest Hard Problem: Number Partitioning , 2006, Computational Complexity and Statistical Physics.

[17]  Jure Leskovec,et al.  Social media analytics: tracking, modeling and predicting the flow of information through networks , 2011, WWW.

[18]  Christos Faloutsos,et al.  Parallel crawling for online social networks , 2007, WWW '07.

[19]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[20]  Yanggon Kim,et al.  Automated Twitter data collecting tool for data mining in social network , 2012, RACS.

[21]  Katarzyna Wegrzyn-Wolska,et al.  Social Network - An Autonomous System Designed for Radio Recommendation , 2009, 2009 International Conference on Computational Aspects of Social Networks.

[22]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.