Reduce and aggregate: similarity ranking in multi-categorical bipartite graphs

We study the problem of computing similarity rankings in large-scale multi-categorical bipartite graphs, where the two sides of the graph represent actors and items, and the items are partitioned into an arbitrary set of categories. The problem has several real-world applications, including identifying competing advertisers and suggesting related queries in an online advertising system or finding users with similar interests and suggesting content to them. In these settings, we are interested in computing on-the-fly rankings of similar actors, given an actor and an arbitrary subset of categories of interest. Two main challenges arise: First, the bipartite graphs are huge and often lopsided (e.g. the system might receive billions of queries while presenting only millions of advertisers). Second, the sheer number of possible combinations of categories prevents the pre-computation of the results for all of them. We present a novel algorithmic framework that addresses both issues for the computation of several graph-theoretical similarity measures, including # common neighbors, and Personalized PageRank. We show how to tackle the imbalance in the graphs to speed up the computation and provide efficient real-time algorithms for computing rankings for an arbitrary subset of categories. Finally, we show experimentally the accuracy of our approach with real-world data, using both public graphs and a very large dataset from Google AdWords.

[1]  Derek Greene,et al.  Spectral Co-Clustering for Dynamic Bipartite Graphs , 2010, NyNaK.

[2]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[3]  William J. Stewart,et al.  Introduction to the numerical solution of Markov Chains , 1994 .

[4]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[5]  Masashi Furukawa,et al.  Eigenvectors for clustering: Unipartite, bipartite, and directed graph cases , 2010, 2010 International Conference on Electronics and Information Engineering.

[6]  W. Knight A Computer Method for Calculating Kendall's Tau with Ungrouped Data , 1966 .

[7]  Gerhard Weikum,et al.  The Juxtaposed approximate PageRank method for robust PageRank approximation in a peer-to-peer web search network , 2008, The VLDB Journal.

[8]  Herbert A. Simon,et al.  Aggregation of Variables in Dynamic Systems , 1961 .

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[11]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[13]  Steve Chien,et al.  Link Evolution: Analysis and Algorithms , 2004, Internet Math..

[14]  Andrei Z. Broder,et al.  Efficient PageRank approximation via graph aggregation , 2004, WWW Alt. '04.

[15]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[16]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[17]  Jinyan Li,et al.  Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment , 2006, Sixth International Conference on Data Mining (ICDM'06).

[18]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[19]  Yunming Ye,et al.  MultiRank: co-ranking for objects and relations in multi-relational data , 2011, KDD.

[20]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[21]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[22]  W. Klein,et al.  Bibliometrics , 2005, Social work in health care.

[23]  Leonid Zhukov,et al.  Clustering of bipartite advertiser-keyword graph , 2003 .

[24]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[25]  Louiqa Raschid,et al.  ApproxRank: Estimating Rank for a Subgraph , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[26]  Wei Wu,et al.  Learning query and document similarities from click-through bipartite graph with metadata , 2013, WSDM.

[27]  Raymond H. Myers,et al.  Probability and Statistics for Engineers and Scientists. , 1973 .

[28]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[29]  Michael R. Lyu,et al.  A generalized Co-HITS algorithm and its application to bipartite graphs , 2009, KDD.

[30]  Fan Chung Graham,et al.  Using PageRank to Locally Partition a Graph , 2007, Internet Math..

[31]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[32]  W. Stewart,et al.  ITERATIVE METHODS FOR COMPUTING STATIONARY DISTRIBUTIONS OF NEARLY COMPLETELY DECOMPOSABLE MARKOV CHAINS , 1984 .

[33]  Carl D. Meyer,et al.  Stochastic Complementation, Uncoupling Markov Chains, and the Theory of Nearly Reducible Systems , 1989, SIAM Rev..

[34]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[35]  Hongyuan Zha,et al.  Co-ranking Authors and Documents in a Heterogeneous Network , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[36]  Anirban Dasgupta,et al.  Approximation algorithms for co-clustering , 2008, PODS.

[37]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[38]  Amy Nicole Langville,et al.  Updating Markov Chains with an Eye on Google's PageRank , 2005, SIAM J. Matrix Anal. Appl..

[39]  Deepayan Chakrabarti,et al.  Preserving Personalized Pagerank in Subgraphs , 2011, ICML.