Discovering Similar Users on Twitter

This work studies the problem of discovering \similar" users at Twitter, where we dene two users to be similar if they produce content similar to each other. The discovery of top similar accounts for each Twitter user has a variety of applications at Twitter including user recommendations and advertiser targeting. Although the discovery of similar results is a well studied problem in information retrieval, the particular problem at Twitter has three novel challenges. The rst one is the heterogenous mix of signals that an effective technique could use: content analysis, social graph structure, user popularity, user interaction data etc. It is apriori unclear how could one blend all these dierent signals in an eective manner. Second, any technique needs to work eectively with very dierent kinds of users. This implies that the same framework needs to be applicable to users with millions of followers and users with very few or no followers. Finally, any proposed technique needs to be able to scale such that it can discover similar users for hundreds of millions of users, while keeping up with the highly dynamic nature of all the input signals. In this work, we share the machine learning based framework that we use to discover similar users at Twitter. We also evaluate the eectiveness of the framework on Twitter data.

[1]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[2]  Jure Leskovec,et al.  Effects of user similarity in social media , 2012, WSDM '12.

[3]  Mohammad Al Hasan,et al.  A Survey of Link Prediction in Social Networks , 2011, Social Network Data Analytics.

[4]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[5]  Jiawei Han,et al.  LINKREC: a unified framework for link recommendation with user attributes and graph structure , 2010, WWW '10.

[6]  Chuang Liu,et al.  The Unified Logging Infrastructure for Data Analytics at Twitter , 2012, Proc. VLDB Endow..

[7]  Pabitra Mitra,et al.  Feature weighting in content based recommendation system using social network analysis , 2008, WWW.

[8]  Yehuda Koren,et al.  Factor in the neighbors: Scalable and accurate collaborative filtering , 2010, TKDD.

[9]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[10]  Ashish Goel,et al.  Dimension independent similarity computation , 2012, J. Mach. Learn. Res..

[11]  Ben Taskar,et al.  Link Prediction in Relational Data , 2003, NIPS.

[12]  Lise Getoor,et al.  Link mining: a new data mining challenge , 2003, SKDD.

[13]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[14]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[15]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[16]  Jure Leskovec,et al.  Supervised random walks: predicting and recommending links in social networks , 2010, WSDM '11.

[17]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[18]  Yoshihiro Yamanishi,et al.  propagation: A fast semisupervised learning algorithm for link prediction , 2009 .

[19]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[20]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[21]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[22]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[23]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[24]  Hong Cheng,et al.  Clustering Large Attributed Graphs: A Balance between Structural and Attribute Similarities , 2011, TKDD.

[25]  Lise Getoor,et al.  Link mining: a survey , 2005, SKDD.

[26]  Yizhou Sun,et al.  RankClus: integrating clustering with ranking for heterogeneous information network analysis , 2009, EDBT '09.

[27]  François Fouss,et al.  Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[28]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[29]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[30]  Jure Leskovec,et al.  Predicting positive and negative links in online social networks , 2010, WWW '10.

[31]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[32]  Jimmy J. Lin,et al.  WTF: the who to follow service at Twitter , 2013, WWW.

[33]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[34]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[35]  Jiawei Han,et al.  A Unified Framework for Link Recommendation Using Random Walks , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[36]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[37]  Hisashi Kashima,et al.  A Parameterized Probabilistic Model of Network Evolution for Supervised Link Prediction , 2006, Sixth International Conference on Data Mining (ICDM'06).

[38]  Jérôme Kunegis,et al.  Learning spectral graph transformations for link prediction , 2009, ICML '09.

[39]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..