GeoCUTS: Geographic Clustering Using Travel Statistics

Web-based services often run experiments to improve their products. To carry out an effective experiment and evaluate the results appropriately, there must be a control group and at least one treatment group. Ideally all of these groups are disjoint, so that each user is given a specific treatment. Using geographical locations as units of experimentation is desirable because this does not require tracking individual users or browser cookies. With the popularity of mobile devices, a user may issue queries from multiple geographical locations. Hence, to be used as units of experimentation, geographical partitions should be chosen in a way that reduces transit between regions. The strategy of clustering users by region is common in advertising. Designated marketing areas (DMAs) are specifically designed for this purpose. However, DMAs are restricted to the US and their granularity is inflexible (there are around two hundred in total). Moreover, they are built based on population density - one DMA per metropolitan area - rather than mobile movement patterns. In this paper, we present GeoCUTS, an algorithm that forms geographical clusters to minimize movement between clusters while preserving rough balance in cluster size. We use a random sample of anonymized mobile user traffic to form a graph representing user movements, then construct a geographically coherent clustering of the graph. We propose a statistical framework to measure the effectiveness of clusterings and perform empirical evaluations showing that the performance of GeoCUTS is comparable to hand-crafted DMAs with respect to both novel and existing metrics. GeoCUTS offers a general and flexible framework for conducting geo-based experiments in any part of the world.

[1]  G. Imbens,et al.  Exact p-Values for Network Interference , 2015, 1506.02084.

[2]  Vahab S. Mirrokni,et al.  Distributed Balanced Partitioning via Linear Embedding , 2015, WSDM.

[3]  Gabriel Kliot,et al.  Streaming graph partitioning for large distributed graphs , 2012, KDD.

[4]  Charalampos E. Tsourakakis,et al.  FENNEL: streaming graph partitioning for massive scale graphs , 2014, WSDM.

[5]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[6]  Jon M. Kleinberg,et al.  Network bucket testing , 2011, WWW.

[7]  P. Aronow,et al.  Estimating Average Causal Effects Under Interference Between Units , 2015 .

[8]  Peter M. Aronow,et al.  Estimating Average Causal Effects Under Interference Between Units , 2013, 1305.6156.

[9]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[10]  Garrett Dash Nelson,et al.  An Economic Geography of the United States: From Commutes to Megaregions , 2016, PloS one.

[11]  Christos Faloutsos,et al.  Analysis of the Clustering Properties of the Hilbert Space-Filling Curve , 2001, IEEE Trans. Knowl. Data Eng..

[12]  Dylan Walker,et al.  Design of Randomized Experiments in Networks , 2014, Proceedings of the IEEE.

[13]  Joel Nishimura,et al.  Restreaming graph partitioning: simple versatile algorithms for advanced balancing , 2013, KDD.

[14]  Silvio Lattanzi,et al.  Connected Components in MapReduce and Beyond , 2014, SoCC.

[15]  Anmol Bhasin,et al.  Network A/B Testing: From Sampling to Estimation , 2015, WWW.

[16]  Dean Eckles,et al.  Design and Analysis of Experiments in Networks: Reducing Bias from Interference , 2014, ArXiv.

[17]  Y. D. Sergeyev,et al.  Global Optimization with Non-Convex Constraints - Sequential and Parallel Algorithms (Nonconvex Optimization and its Applications Volume 45) (Nonconvex Optimization and Its Applications) , 2000 .

[18]  Lars Backstrom,et al.  Find me if you can: improving geographical prediction with social and spatial proximity , 2010, WWW '10.

[19]  Edoardo M. Airoldi,et al.  Detecting Network Effects: Randomizing Over Randomized Experiments , 2017, KDD.

[20]  A. Donner,et al.  Pitfalls of and controversies in cluster randomization trials. , 2004, American journal of public health.

[21]  Diane Tang,et al.  Focusing on the Long-term: It's Good for Users and Business , 2015, KDD.

[22]  Edoardo M. Airoldi,et al.  Optimal design of experiments in the presence of network-correlated outcomes , 2015, ArXiv.

[23]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[24]  Andrew V. Goldberg,et al.  Exact Combinatorial Branch-and-Bound for Graph Bisection , 2012, ALENEX.

[25]  Edo Liberty,et al.  Framework and algorithms for network bucket testing , 2012, WWW.

[26]  Konstantin Andreev,et al.  Balanced Graph Partitioning , 2004, SPAA '04.

[27]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[28]  Jon M. Kleinberg,et al.  Graph cluster randomization: network exposure to multiple universes , 2013, KDD.

[29]  P. Aronow,et al.  Unbiased Estimation of the Average Treatment Effect in Cluster-Randomized Experiments , 2011 .

[30]  Andrew V. Goldberg,et al.  Graph Partitioning with Natural Cuts , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[31]  Jon Vaver,et al.  Periodic Measurement of Advertising Effectiveness Using Multiple-Test-Period Geo Experiments , 2012 .

[32]  Lars Backstrom,et al.  Balanced label propagation for partitioning massive graphs , 2013, WSDM.

[33]  Ashwin Machanavajjhala,et al.  Finding connected components in map-reduce in logarithmic rounds , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).