Session stitching using sequence fingerprinting for web page visits

Abstract The nature of people's web navigation has significantly changed in recent years. The advent of smartphones and other handheld devices has given rise to web users consulting websites with more than one device, or using a shared device. As a result, large volumes of seemingly disjoint data are available, which when analysed together can support decision-making. The task of identifying web sessions by linking such data back to a specific person, however, is hard. The idea of session stitching aims to overcome this by using machine learning inference to identify similar or identical users. Many such efforts use various demographic data or device-based features to train matching algorithms. However, often these variables are not available for every dataset or are recorded differently, making a streamlined setup difficult. Besides, the often result in vast feature spaces which are hard to use for actionable interpretation. In this paper, we present an alternative approach based on the fingerprinting of web pages visited by users in a single session. By learning behavioural patterns from these sequences of page visits, we obtain features that can be used for matching without requiring sensitive user-agent data such as IP, geo location, or device details as is common with other approaches. Using these sequential fingerprints does not rely on pre-defined features, but only requires the recording of web page visits, making our approach actionable. The approach is empirically tested on real-life web logs and compared with matching using regular user-agent features and state-of-the-art embedding techniques. Results in an ecommerce context show sequential features can still obtain strong performance with fewer features, facilitating decision-making on session stitching and inform subsequent related activities such as marketing or customer analysis.

[1]  Maria Pershina,et al.  Holistic entity matching across knowledge graphs , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[2]  Charles A. Sutton,et al.  A Subsequence Interleaving Model for Sequential Pattern Mining , 2016, KDD.

[3]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[4]  Rahul Biswas,et al.  Adobe Identity Graph , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[5]  Danai Koutra,et al.  Toward Activity Discovery in the Personal Web , 2020, WSDM.

[6]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[7]  Wil M. P. van der Aalst,et al.  DECLARE: Full Support for Loosely-Structured Processes , 2007, 11th IEEE International Enterprise Distributed Object Computing Conference (EDOC 2007).

[8]  Massimo Mecella,et al.  A two-step fast algorithm for the automated discovery of declarative workflows , 2013, 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[9]  Jochen De Weerdt,et al.  Mining Behavioral Sequence Constraints for Classification , 2020, IEEE Transactions on Knowledge and Data Engineering.

[10]  Srdjan Capkun,et al.  Quantifying Web-Search Privacy , 2014, CCS.

[11]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[12]  Alissa Cooper,et al.  A survey of query log privacy-enhancing techniques from a policy perspective , 2008, TWEB.

[13]  Ravi Kumar,et al.  "I know what you did last summer": query logs and user privacy , 2007, CIKM '07.

[14]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[15]  William C. McDowell,et al.  An examination of retail website design and conversion rate , 2016 .

[16]  Yin-Fu Huang,et al.  Mining web logs to improve hit ratios of prefetching and caching , 2008, Knowl. Based Syst..

[17]  Hsinchun Chen,et al.  A hierarchical Naïve Bayes model for approximate identity matching , 2011, Decis. Support Syst..

[18]  Reda Alhajj,et al.  Effective web log mining and online navigational pattern prediction , 2013, Knowl. Based Syst..

[19]  Yuanyuan Qiao,et al.  Linking User Online Behavior across Domains with Internet Traffic , 2018, J. Univers. Comput. Sci..

[20]  Shubhranshu Shekhar,et al.  Entity Resolution in Dynamic Heterogeneous Networks , 2020, WWW.

[21]  Marc Boullé,et al.  A user parameter-free approach for mining robust sequential classification rules , 2017, Knowledge and Information Systems.

[22]  Denis Simakov,et al.  Feature-Based Sequence-to-Sequence Matching , 2006, International Journal of Computer Vision.

[23]  Josep Freixas Bosch An equivalent formulation for the Shapley value , 2018 .

[24]  Mark Heimann,et al.  node2bits: Compact Time- and Attribute-aware Node Representations for User Stitching , 2019, ECML/PKDD.

[25]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[26]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[27]  Hajo A. Reijers,et al.  UnconstrainedMiner: Efficient Discovery of Generalized Declarative Process Models , 2013 .

[28]  Jeffrey Xu Yu,et al.  Entity Matching: How Similar Is Similar , 2011, Proc. VLDB Endow..

[29]  Felix Naumann,et al.  A Comparison and Generalization of Blocking and Windowing Algorithms for Duplicate Detection , 2009 .

[30]  Surajit Chaudhuri,et al.  Mining Document Collections to Facilitate Accurate Approximate Entity Matching , 2009, Proc. VLDB Endow..

[31]  Yuanyuan Qiao,et al.  Siamese Neural Networks for User Identity Linkage Through Web Browsing , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[32]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[33]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[34]  Silvio Lattanzi,et al.  Linking Users Across Domains with Location Data: Theory and Validation , 2016, WWW.

[35]  Garrett A. Johnson,et al.  Regulating Privacy Online: The Early Impact of the GDPR on European Web Traffic & E-Commerce Outcomes , 2019, SSRN Electronic Journal.

[36]  Hongzhi Wang,et al.  Efficient Entity Resolution Based on Sequence Rules , 2011, CSIE 2011.

[37]  Lise Getoor,et al.  Probabilistic Visitor Stitching on Cross-Device Web Logs , 2017, WWW.

[38]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[39]  Priyanka Bhatt,et al.  Robust Factorization Machines for User Response Prediction , 2018, WWW.

[40]  Zhiyuan Liu,et al.  Iterative Entity Alignment via Joint Knowledge Embeddings , 2017, IJCAI.