On data collection, graph construction, and sampling in Twitter

We present a detailed study on data collection, graph construction, and sampling in Twitter. We observe that sampling on semantic graphs (i.e., graphs with multiple edge types) presents fundamentally distinct challenges from sampling on traditional graphs. The purpose of our work is to present new challenges and initial solutions for sampling semantic graphs. Novel elements of our work include the following: (1) We provide a thorough discussion of problems encountered with naïve breadth-first search on semantic graphs. We argue that common sampling methods such as breadth-first search face specific challenges on semantic graphs that are not encountered on graphs with homogeneous edge types. (2) We present two competing methods for creating semantic graphs from data collects, corresponding to the interactions between sampling of different edge types. (3) We discuss new metrics specific to graphs with multiple edge types, and discuss the effect of the sampling method on these metrics. (4) We discuss issues and potential solutions pertaining to sampling semantic graphs.

[1]  Carsten Wiuf,et al.  Subnets of scale-free networks are not scale-free: sampling properties of networks. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Tanya Y. Berger-Wolf,et al.  Sampling community structure , 2010, WWW '10.

[3]  J. Delvenne,et al.  Random walks on graphs , 2004 .

[4]  Hawoong Jeong,et al.  Statistical properties of sampled networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  Jennifer Neville,et al.  Network Sampling: Methods and Applications , 2013 .

[6]  F. Göbel,et al.  Random walks on graphs , 1974 .

[7]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[8]  Hamid R. Rabiee,et al.  Characterizing Twitter with Respondent-Driven Sampling , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[9]  M E J Newman Assortative mixing in networks. , 2002, Physical review letters.

[10]  Athina Markopoulou,et al.  On the bias of BFS , 2010, ArXiv.

[11]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[12]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[13]  Cristopher Moore,et al.  On the bias of traceroute sampling: Or, power-law degree distributions in regular graphs , 2005, JACM.

[14]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[15]  Tanya Y. Berger-Wolf,et al.  Benefits of bias: towards better characterization of network sampling , 2011, KDD.

[16]  K. Selçuk Candan,et al.  How Does the Data Sampling Strategy Impact the Discovery of Information Diffusion in Social Media? , 2010, ICWSM.

[17]  Minas Gjoka,et al.  A Walk in Facebook: Uniform Sampling of Users in Online Social Networks , 2009, ArXiv.

[18]  Liudmila Ostroumova,et al.  Quick Detection of High-Degree Entities in Large Directed Networks , 2014, 2014 IEEE International Conference on Data Mining.

[19]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[20]  Tanya Y. Berger-Wolf,et al.  Expansion and decentralized search in complex networks , 2012, Knowledge and Information Systems.

[21]  Krishna P. Gummadi,et al.  On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream , 2013, CIKM.

[22]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.