Partitioned multi-indexing: bringing order to social search

To answer search queries on a social network rich with user-generated content, it is desirable to give a higher ranking to content that is closer to the individual issuing the query. Queries occur at nodes in the network, documents are also created by nodes in the same network, and the goal is to find the document that matches the query and is closest in network distance to the node issuing the query. In this paper, we present the "Partitioned Multi-Indexing" scheme, which provides an approximate solution to this problem. With m links in the network, after an offline ~O(m) pre-processing time, our scheme allows for social index operations (i.e., social search queries, as well as insertion and deletion of words into and from a document at any node), all in time ~O(1). Further, our scheme can be implemented on open source distributed streaming systems such as Yahoo! S4 or Twitter's Storm so that every social index operation takes ~O(1) processing time and network queries in the worst case, and just two network queries in the common case where the reverse index corresponding to the query keyword is much smaller than the memory available at any distributed compute node. Building on Das Sarma et al.'s approximate distance oracle, the worst-case approximation ratio of our scheme is ~O(1) for undirected networks. Our simulations on the social network Twitter as well as synthetic networks show that in practice, the approximation ratio is actually close to 1 for both directed and undirected networks. We believe that this work is the first demonstration of the feasibility of social search with real-time text updates at large scales.

[1]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[2]  Berthier A. Ribeiro-Neto,et al.  Efficient search ranking in social networks , 2007, CIKM '07.

[3]  Ido Guy,et al.  Personalized social search based on the user's social network , 2009, CIKM.

[4]  Sreenivas Gollapudi,et al.  A sketch-based distance oracle for web-scale graphs , 2010, WSDM '10.

[5]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[6]  Xiaolong Zhang,et al.  Social network document ranking , 2010, JCDL '10.

[7]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[8]  Marvin B. Shapiro The choice of reference points in best-match file searching , 1977, CACM.

[9]  Mikkel Thorup,et al.  Approximate distance oracles , 2001, JACM.

[10]  Matthew Richardson,et al.  Yes, there is a correlation: - from social networks to personal behavior on the web , 2008, WWW.

[11]  Raimund Seidel,et al.  On the all-pairs-shortest-path problem , 1992, STOC '92.

[12]  Ken C. K. Lee,et al.  On top-k social web search , 2010, CIKM.

[13]  Aristides Gionis,et al.  Fast shortest path distance estimation in large networks , 2009, CIKM.

[14]  J. Bourgain The metrical interpretation of superreflexivity in banach spaces , 1986 .

[15]  Edith Cohen,et al.  Reachability and distance queries via 2-hop labels , 2002, SODA '02.

[16]  Felix Naumann,et al.  SPRINT: ranking search results by paths , 2011, EDBT/ICDT '11.

[17]  Jon M. Kleinberg,et al.  Triangulation and embedding using small sets of beacons , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[18]  Damon Horowitz,et al.  The anatomy of a large-scale social search engine , 2010, WWW '10.

[19]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[20]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[21]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[22]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[23]  Laks V. S. Lakshmanan,et al.  Efficient network aware search in collaborative tagging sites , 2008, Proc. VLDB Endow..

[24]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[25]  E. Ruiz An algorithm for finding nearest neighbours in (approximately) constant average time , 1986 .

[26]  Andrew V. Goldberg,et al.  Computing the shortest path: A search meets graph theory , 2005, SODA '05.

[27]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[28]  Mark Crovella,et al.  Virtual landmarks for the internet , 2003, IMC '03.

[29]  Uri Zwick,et al.  All-Pairs Almost Shortest Paths , 1997, SIAM J. Comput..

[30]  Uri Zwick,et al.  Exact and Approximate Distances in Graphs - A Survey , 2001, ESA.

[31]  Timothy M. Chan All-pairs shortest paths for unweighted undirected graphs in o(mn) time , 2012, TALG.

[32]  Yong Yu,et al.  Optimizing web search using social annotations , 2007, WWW '07.