Probabilistic Visitor Stitching on Cross-Device Web Logs

Personalization -- the customization of experiences, interfaces, and content to individual users -- has catalyzed user growth and engagement for many web services. A critical prerequisite to personalization is establishing user identity. However the variety of devices, including mobile phones, appliances, and smart watches, from which users access web services from both anonymous and logged-in sessions poses a significant obstacle to user identification. The resulting entity resolution task of establishing user identity across devices and sessions is commonly referred to as ``visitor stitching.'' We introduce a general, probabilistic approach to visitor stitching using features and attributes commonly contained in web logs. Using web logs from two real-world corporate websites, we motivate the need for probabilistic models by quantifying the difficulties posed by noise, ambiguity, and missing information in deployment. Next, we introduce our approach using probabilistic soft logic (PSL), a statistical relational learning framework capable of capturing similarities across many sessions and enforcing transitivity. We present a detailed description of model features and design choices relevant to the visitor stitching problem. Finally, we evaluate our PSL model on binary classification performance for two real-world visitor stitching datasets. Our model demonstrates significantly better performance than several state-of-the-art classifiers, and we show how this advantage results from collective reasoning across sessions.

[1]  Philip S. Yu,et al.  Integrated Anchor and Social Link Predictions across Social Networks , 2015, IJCAI.

[2]  GetoorLise,et al.  Network-based drug-target interaction prediction with probabilistic soft logic , 2014 .

[3]  Tao Luo,et al.  Effective personalization based on association rule discovery from web usage data , 2001, WIDM '01.

[4]  Lise Getoor,et al.  Budgeted Online Collective Inference , 2015, UAI.

[5]  Louiqa Raschid,et al.  Ieee/acm Transactions on Computational Biology and Bioinformatics 1 Network-based Drug-target Interaction Prediction with Probabilistic Soft Logic , 2022 .

[6]  Lise Getoor,et al.  Knowledge Graph Identification , 2013, SEMWEB.

[7]  Ryen W. White,et al.  Personalizing Search on Shared Devices , 2015, SIGIR.

[8]  Sebastian Ertel,et al.  Google Universal Analytics , 2014, Datenschutz und Datensicherheit - DuD.

[9]  Yinghui Yang,et al.  Web user behavioral profiling for user identification , 2010, Decis. Support Syst..

[10]  Virgílio A. F. Almeida,et al.  Studying User Footprints in Different Online Social Networks , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[11]  Jasmine Novak,et al.  Geographic routing in social networks , 2005, Proc. Natl. Acad. Sci. USA.

[12]  Peter Eckersley,et al.  How Unique Is Your Web Browser? , 2010, Privacy Enhancing Technologies.

[13]  GetoorLise,et al.  Hinge-loss Markov random fields and probabilistic soft logic , 2017 .

[14]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[15]  Eibe Frank,et al.  Speeding Up Logistic Model Tree Induction , 2005, PKDD.

[16]  Ryen W. White,et al.  Cross-Device Search , 2014, CIKM.

[17]  Kathryn Fraughnaugh,et al.  Introduction to graph theory , 1973, Mathematical Gazette.

[18]  Jon M. Kleinberg,et al.  Mapping the world's photos , 2009, WWW '09.

[19]  Lise Getoor,et al.  A hypergraph-partitioned vertex programming approach for large-scale consensus optimization , 2013, 2013 IEEE International Conference on Big Data.

[20]  Rishiraj Saha Roy,et al.  Probabilistic Deduplication of Anonymous Web Traffic , 2015, WWW.

[21]  References , 1971 .

[22]  Silvio Lattanzi,et al.  Linking Users Across Domains with Location Data: Theory and Validation , 2016, WWW.

[23]  Lise Getoor,et al.  Ontology-aware partitioning for knowledge graph identification , 2013, AKBC '13.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Martín Casado,et al.  Peering Through the Shroud: The Effect of Edge Opacity on IP-Based Client Identification , 2007, NSDI.

[26]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[27]  Jon M. Kleinberg,et al.  Spatial variation in search engine queries , 2008, WWW.

[28]  Michael Bailey,et al.  People and Cookies: Imperfect Treatment Assignment in Online Experiments , 2016, WWW.

[29]  James R. Foulds,et al.  HyPER: A Flexible and Extensible Probabilistic Framework for Hybrid Recommender Systems , 2015, RecSys.

[30]  Anirban Dasgupta,et al.  Overcoming browser cookie churn with clustering , 2012, WSDM '12.

[31]  Lars Backstrom,et al.  Find me if you can: improving geographical prediction with social and spatial proximity , 2010, WWW '10.

[32]  Larry S. Davis,et al.  Collective Activity Detection Using Hinge-loss Markov Random Fields , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.