Segmenting User Sessions in Search Engine Query Logs Leveraging Word Embeddings

Segmenting user sessions in search engine query logs is important to perceive information needs and assess how they are satisfied, to enhance the quality of search engine rankings, and to better direct content to certain users. Most previous methods use human judgments to inform supervised learning algorithms, and/or use global thresholds on temporal proximity and on simple lexical similarity metrics. This paper proposes a novel unsupervised method that improves the current state-of-art, leveraging additional heuristics and similarity metrics derived from word embeddings. We specifically extend a previous approach based on combining temporal and lexical similarity measurements, integrating semantic similarity components that use pre-trained FastText embeddings. The paper reports on experiments with an AOL query dataset used in previous studies, containing a total of 10,235 queries, with 4,253 sessions, 2.4 queries per session, and 215 unique users. The results attest to the effectiveness of the proposed method, which outperforms a large set of baselines, also corresponding to unsupervised techniques.

[1]  Daniel Gayo-Avello,et al.  A survey on session detection methods in query logs and a proposal for future evaluation , 2009, Inf. Sci..

[2]  Matthias Hagen,et al.  Query session detection as a cascade , 2011, CIKM '11.

[3]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[4]  Nick Craswell,et al.  Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.

[5]  Fabrizio Silvestri,et al.  Identifying task-based sessions in search engine query logs , 2011, WSDM '11.

[6]  Amanda Spink,et al.  Automatic New Topic Identification in Search Engine Transaction Logs  Using Multiple Linear Regression , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[7]  Ryen W. White,et al.  Modeling dwell time to predict click-level satisfaction , 2014, WSDM.

[8]  Zhe Gan,et al.  Character-level deep conflation for business data analytics , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[10]  Ryen W. White,et al.  Understanding and Predicting Graded Search Satisfaction , 2015, WSDM.

[11]  Patricia Murrieta-Flores,et al.  Toponym matching through deep neural networks , 2018, Int. J. Geogr. Inf. Sci..

[12]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[13]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[14]  Rosie Jones,et al.  Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[15]  Doug Downey,et al.  Models of Searching and Browsing: Languages, Studies, and Application , 2007, IJCAI.

[16]  James E. Pitkow,et al.  Characterizing Browsing Strategies in the World-Wide Web , 1995, Comput. Networks ISDN Syst..

[17]  Amanda Spink,et al.  Defining a session on Web search engines , 2007, J. Assoc. Inf. Sci. Technol..

[18]  Daqing He,et al.  Detecting session boundaries from Web user logs , 2000 .

[19]  Philipp Mayr,et al.  A Complete Year of User Retrieval Sessions in a Social Sciences Academic Search Engine , 2017, TPDL.

[20]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[21]  Milad Shokouhi,et al.  Deep Sequential Models for Task Satisfaction Prediction , 2017, CIKM.

[22]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[23]  Dror G. Feitelson,et al.  On extracting session data from activity logs , 2012, SYSTOR '12.

[24]  Patricia Murrieta-Flores,et al.  Learning to combine multiple string similarity metrics for effective toponym matching , 2018, Int. J. Digit. Earth.

[25]  James Allan,et al.  Predicting searcher frustration , 2010, SIGIR.