Analyse et application de la diffusion d'information dans les microblogs. (The analysis and applications of information diffusion in microblogs)

Microblog service (such as Twitter and Sina Weibo) have become an important platform for Internet content sharing. As the information in Microblog are widely used in public opinion mining, viral marketing and political campaigns, understanding how information diffuses over Microblogs, and explaining the process through which some tweets become popular, are important.The analysis of the information diffusion in Microblogs involves the data collection from Microblog, the modeling on information spreading and using the resulting models. Dealing with the huge amount of data flowing through microblogs is by itself a challenge. Designing an efficient and unbiased sampling algorithm for Microblog is therefore essential. Besides, the retweeting process in Microblog is complex because of the ephemerality of information, the topology of Microblog network and the particular features (such as number of followers) of publisher and retweeters.Two traditional models have been used for information diffusion : Independent Cascades and Linear Threshold models. However no one of them can describe completely the retweeting process in Microblog accurately. The analysis and design of new models to characterize the information diffusion in Microblog is therefore necessary. Moreover, a comprehensive description of the correlation between the information diffusion in Microblog and the searching trends of keywords on search engines is lacking although some work has been found some preliminary relationships.This work presnets a complete analysis of information diffusion in Microblog from. The contributions and innovations of this thesis are as follows:1)There are two popular unbiased Online Social Network (OSN) sampling algorithms,Metropolis-Hastings Random Walk (MHRW) and Unbiased Sampling for Directed Social Graph (USDSG) method. However they are both likely to yield considerable self-sampling probabilities when applied to Microblogs where there is local. To solve this problem, I have modelled the process of OSN sampling as a Markov process and have deduced the sufficient and necessary conditions of unbiased sampling. Based on this unbiased conditions, I proposed an efficient and unbiased sampling algorithms, Unbiased Sampling method with Dummy Edges (USDE), which reduces strongly the self-sampling probabilities of MHRW. The experimental evaluation demonstrate thats the average node degree of samples of MHRW and USDSG is 2 - 4 times as high as the ground truth while USDE can provide the approximation of ground truth when the sampling repetitions are removed. Moreover the average sampling time per node in USDE is only a half of MHRW and USDSG one.2)A second contribution targets the shortages of Independent Cascades (IC) and Linear Threshold (LT) models in characterizing the retweeting process in Microblogs. I achieve this by introducing a Galton Watson with Killing (GWK) model which considers all the three important factors including the ephemerality of information, the topology of network and the features of publisher and retweeters accurately. We have validated the applicability of the of GWK model over two datasets from Sina Weibo and Twitter and showed that GWK model can fit 82% of information receivers and 90% of the maximum numbers of hops in the real retweeting process. Besides, the GWK model is useful for revealing the endogenous and exogenous factors which affect the popularity of tweets.3) Motivated by the correlation between popularity and trendiness of topicsin Microblog and search trends, I have developed an economic analysis of the market involving a third-party ad broker, which is a popular market in current SEM, and finds that the adwords augmenting strategy with the trending and popular topics in Twitter enables the broker to achieve, on average, four folds larger return on investment than with a non-augmented strategy, while still maintaining the same level of risk.

[1]  Hawoong Jeong,et al.  Statistical properties of sampled networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  Jon Feldman,et al.  Online allocation of display ads with smooth delivery , 2012, KDD.

[3]  Aranyak Mehta,et al.  Online budgeted matching in random input models with applications to Adwords , 2008, SODA '08.

[4]  Roksana Boreli,et al.  How Much Is Too Much? Leveraging Ads Audience Estimation to Evaluate Public Profile Uniqueness , 2013, Privacy Enhancing Technologies.

[5]  Arturo Azcorra,et al.  Are trending topics useful for marketing?: visibility of trending topics vs traditional advertisement , 2013, COSN '13.

[6]  Luca Becchetti,et al.  A Comparison of Sampling Techniques for Web Graph Characterization , 2006 .

[7]  H. Eugene Stanley,et al.  Quantifying Wikipedia Usage Patterns Before Stock Market Moves , 2013, Scientific Reports.

[8]  Jure Leskovec,et al.  Modeling Information Diffusion in Implicit Networks , 2010, 2010 IEEE International Conference on Data Mining.

[9]  Jun Wang,et al.  Real-time bidding for online advertising: measurement and analysis , 2013, ADKDD '13.

[10]  Rajeev Motwani,et al.  Keyword Generation for Search Engine Advertising , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[11]  Jean C. Walrand,et al.  Fair end-to-end window-based congestion control , 2000, TNET.

[12]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.

[13]  Michael S. Bernstein,et al.  Direct answers for search queries in the long tail , 2012, CHI.

[14]  S. Muthukrishnan,et al.  General auction mechanism for search advertising , 2008, WWW '09.

[15]  Walter Willinger,et al.  On unbiased sampling for unstructured peer-to-peer networks , 2009, TNET.

[16]  P. Gloor,et al.  Predicting Stock Market Indicators Through Twitter “I hope it is not as bad as I fear” , 2011 .

[17]  Lifeng Sun,et al.  Guiding internet-scale video service deployment using microblog-based prediction , 2012, 2012 Proceedings IEEE INFOCOM.

[18]  Yaron Singer,et al.  How to win friends and influence people, truthfully: influence maximization mechanisms for social networks , 2012, WSDM '12.

[19]  Jure Leskovec,et al.  Information diffusion and external influence in networks , 2012, KDD.

[20]  Minas Gjoka,et al.  Practical Recommendations on Crawling Online Social Networks , 2011, IEEE Journal on Selected Areas in Communications.

[21]  Balachander Krishnamurthy,et al.  A few chirps about twitter , 2008, WOSN '08.

[22]  Jure Leskovec,et al.  Patterns of temporal variation in online media , 2011, WSDM '11.

[23]  Aditya G. Parameswaran,et al.  Blogs as Predictors of Movie Success , 2009, ICWSM.

[24]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[25]  Alok N. Choudhary,et al.  Random walk-based graphical sampling in unbalanced heterogeneous bipartite social graphs , 2013, CIKM.

[26]  Susan T. Dumais,et al.  Towards Supporting Search over Trending Events with Social Media , 2013, ICWSM.

[27]  D. Sornette,et al.  Endogenous Versus Exogenous Shocks in Complex Networks: An Empirical Test Using Book Sale Rankings , 2003, Physical review letters.

[28]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[29]  Michalis Vazirgiannis,et al.  Multiword Keyword Recommendation System for Online Advertising , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[30]  Jure Leskovec,et al.  Inferring networks of diffusion and influence , 2010, KDD.

[31]  Wolfgang Kellerer,et al.  Outtweeting the Twitterers - Predicting Information Cascades in Microblogs , 2010, WOSN.

[32]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[33]  Meredith Ringel Morris,et al.  #TwitterSearch: a comparison of microblog search and web search , 2011, WSDM '11.

[34]  Kavé Salamatian,et al.  An Approach to Model and Predict the Popularity of Online Contents with Explanatory Factors , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[35]  Anirban Dasgupta,et al.  On estimating the average degree , 2014, WWW.

[36]  Walter Willinger,et al.  Respondent-Driven Sampling for Characterizing Unstructured Overlays , 2009, IEEE INFOCOM 2009.

[37]  Donald F. Towsley,et al.  Estimating and sampling graphs with multidimensional random walks , 2010, IMC '10.

[38]  ZhangZengbin,et al.  Unbiased sampling in directed social graph , 2010 .

[39]  Minas Gjoka,et al.  Walking on a graph with a magnifying glass: stratified sampling via weighted random walks , 2011, PERV.

[40]  William J. Reed,et al.  On the distribution of family names , 2003 .

[41]  Yifan Chen,et al.  Advertising keyword suggestion based on concept hierarchy , 2008, WSDM '08.

[42]  D. Sornette,et al.  Extreme Deviations and Applications , 1997, cond-mat/9705132.

[43]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[44]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[45]  A. M. Zubkov,et al.  Branching processes. I , 1987 .

[46]  Cecilia Mascolo,et al.  Track globally, deliver locally: improving content delivery networks by tracking geographic social cascades , 2011, WWW.

[47]  J. Avery,et al.  The long tail. , 1995, Journal of the Tennessee Medical Association.

[48]  Seungyeop Han,et al.  Analysis of topological characteristics of huge online social networking services , 2007, WWW '07.

[49]  Ralf Herbrich,et al.  Predicting Information Spreading in Twitter , 2010 .

[50]  Isabell M. Welpe,et al.  Tweets and Trades: The Information Content of Stock Microblogs , 2010 .

[51]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[52]  Minas Gjoka,et al.  Walking in Facebook: A Case Study of Unbiased Sampling of OSNs , 2010, 2010 Proceedings IEEE INFOCOM.

[53]  Songqing Chen,et al.  The stretched exponential distribution of internet media access patterns , 2008, PODC '08.

[54]  Hamed Haddadi,et al.  Flash floods and ripples: The spread of media content through the blogosphere , 2009, ICWSM 2009.

[55]  Pablo Rodriguez,et al.  I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system , 2007, IMC '07.

[56]  Lada A. Adamic,et al.  Tracking information epidemics in blogspace , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[57]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[58]  J. Leskovec,et al.  Cascading Behavior in Large Blog Graphs Patterns and a model , 2006 .

[59]  Donald F. Towsley,et al.  Improving Random Walk Estimation Accuracy with Uniform Restarts , 2010, WAW.

[60]  Krishna P. Gummadi,et al.  Media Landscape in Twitter: A World of New Conventions and Political Diversity , 2011, ICWSM.

[61]  Brian H. Spitzberg,et al.  Mapping social activities and concepts with social media (Twitter) and web search engines (Yahoo and Bing): a case study in 2012 US Presidential Election , 2013 .

[62]  Tie-Yan Liu,et al.  Psychological advertising: exploring user psychology for click prediction in sponsored search , 2013, KDD.

[63]  Shyhtsun Felix Wu,et al.  Measuring message propagation and social influence on Twitter.com , 2010, Int. J. Commun. Networks Distributed Syst..

[64]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[65]  Oliver Hinz,et al.  An analysis of the importance of the long tail in search engine marketing , 2010, Electron. Commer. Res. Appl..

[66]  Christos Faloutsos,et al.  Modeling Blog Dynamics , 2009, ICWSM.

[67]  Krishna P. Gummadi,et al.  A measurement-driven analysis of information propagation in the flickr social network , 2009, WWW '09.

[68]  Athina Markopoulou,et al.  On the bias of BFS , 2010, ArXiv.

[69]  Aranyak Mehta,et al.  AdWords and Generalized On-line Matching , 2005, FOCS.

[70]  Athina Markopoulou,et al.  Towards Unbiased BFS Sampling , 2011, IEEE Journal on Selected Areas in Communications.

[71]  Kristina Lerman,et al.  Information Contagion: An Empirical Study of the Spread of News on Digg and Twitter Social Networks , 2010, ICWSM.

[72]  László Lovász,et al.  Blocking Conductance and Mixing in Random Walks , 2006, Combinatorics, Probability and Computing.

[73]  Hosung Park,et al.  Sampling bias in user attribute estimation of OSNs , 2013, WWW '13 Companion.

[74]  Shyhtsun Felix Wu,et al.  Crawling Online Social Graphs , 2010, 2010 12th International Asia-Pacific Web Conference.

[75]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[76]  Mirjam Wattenhofer,et al.  YouTube around the world: geographic popularity of videos , 2012, WWW.

[77]  Ye Chen,et al.  Position-normalized click prediction in search advertising , 2012, KDD.

[78]  Jiangchuan Liu,et al.  Statistics and Social Network of YouTube Videos , 2008, 2008 16th Interntional Workshop on Quality of Service.

[79]  Mohamed Ali Kâafar,et al.  You are what you like! Information leakage through users' Interests , 2012, NDSS.

[80]  Jianguo Lu,et al.  Bias Correction in a Small Sample from Big Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[81]  H. Stanley,et al.  Quantifying Trading Behavior in Financial Markets Using Google Trends , 2013, Scientific Reports.

[82]  Dong Wang,et al.  Towards Unbiased Sampling of Online Social Networks , 2011, 2011 IEEE International Conference on Communications (ICC).

[83]  Shlomo Moran,et al.  Rank-Stability and Rank-Similarity of Link-Based Web Ranking Algorithms in Authority-Connected Graphs , 2005, Information Retrieval.

[84]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[85]  D. Sornette,et al.  Stretched exponential distributions in nature and economy: “fat tails” with characteristic scales , 1998, cond-mat/9801293.

[86]  Naoki Masuda,et al.  Random Walks on Directed Networks: Inference and Respondent-Driven Sampling , 2013, ArXiv.

[87]  Guibo Zhu,et al.  Click-Through Prediction for Sponsored Search Advertising with Hybrid Models , 2012 .

[88]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[89]  Jure Leskovec,et al.  Correcting for missing data in information cascades , 2011, WSDM '11.

[90]  Béla Bollobás,et al.  Modern Graph Theory , 2002, Graduate Texts in Mathematics.

[91]  Donald F. Towsley,et al.  On the estimation accuracy of degree distributions from graph sampling , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[92]  Oliver J. Rutz,et al.  A Model of Individual Keyword Performance in Paid Search Advertising , 2007 .

[93]  Donald F. Towsley,et al.  Sampling directed graphs with random walks , 2012, 2012 Proceedings IEEE INFOCOM.

[94]  Krishna P. Gummadi,et al.  Characterizing social cascades in flickr , 2008, WOSN '08.

[95]  Haewoon Kwak,et al.  Fragile online relationship: a first look at unfollow dynamics in twitter , 2011, CHI.

[96]  Anindya Ghose,et al.  An Empirical Analysis of Search Engine Advertising: Sponsored Search in Electronic Markets , 2009, Manag. Sci..

[97]  Damien Challet,et al.  Predicting Financial Markets with Google Trends and Not so Random Keywords , 2013, 1307.4643.

[98]  Yong Gao,et al.  Statistical behavior of embeddedness and communities of overlapping cliques in online social networks , 2010, 2011 Proceedings IEEE INFOCOM.