G-WSTD: a framework for geographic web search topic discovery

Search engine query log is an important information source that contains millions of users' interests and information needs. In this paper, we tackle the problem of discovering latent geographic search topics via mining search engine query logs. A novel framework G-WSTD that contains search session derivation, geographic information extraction and geographic search topic discovery is developed to support a variety of downstream web applications. The core components of the framework are two topic models, which discover geographic search topics from two different perspectives. The first one is the Discrete Search Topic Model (DSTM), which aims to capture the semantic commonalities across discrete geographic locations. The second one is the Regional Search Topic Model (RSTM), which focuses on a specific region on the map and discovers web search topics that demonstrate geographic locality. We evaluate our framework against several strong baselines on a real-life query log. The framework demonstrates improved data interpretability, better prediction performance and higher topic distinctiveness in the experimentation. The effectiveness of the framework is also verified by applications such as user profiling and URL annotation.

[1]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2]  Chong Wang,et al.  Mining geographic knowledge using location aware topic model , 2007, GIR '07.

[3]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[4]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[5]  Matsumoto Yuji,et al.  Document Clustering : Before and After the Singular Value Decomposition , 1999 .

[6]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[7]  Hema Raghavan,et al.  Discovering users' specific geo intention in web search , 2009, WWW '09.

[8]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[9]  Farooq Ahmad,et al.  Learning a Spelling Error Model from Search Query Logs , 2005, HLT.

[10]  Wessel Kraaij,et al.  Annotation of URLs: more than the sum of parts , 2009, SIGIR.

[11]  Joemon M. Jose,et al.  Automatic topic detection strategy for information retrieval in spoken document , 2009, 2009 10th Workshop on Image Analysis for Multimedia Interactive Services.

[12]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[13]  Filip Radlinski,et al.  Personalizing web search using long term browsing history , 2011, WSDM '11.

[14]  Jiawei Han,et al.  Geographical topic discovery and comparison , 2011, WWW.

[15]  Jochen Schiller,et al.  Location Based Services , 2004 .

[16]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Changhu Wang,et al.  Equip tourists with knowledge mined from travelogues , 2010, WWW '10.

[18]  Wei Gao,et al.  Cross-lingual query suggestion using query logs of different languages , 2007, SIGIR.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Torsten Suel,et al.  Analysis of geographic queries in a search engine log , 2008, LocWeb.

[21]  Ying Li,et al.  Detecting dominant locations from search queries , 2005, SIGIR '05.

[22]  Di Jiang,et al.  Context-aware search personalization with concept preference , 2011, CIKM '11.

[23]  Efthimis N. Efthimiadis,et al.  Analyzing and evaluating query reformulation strategies in web search logs , 2009, CIKM.

[24]  Sergej Sizov,et al.  GeoFolk: latent spatial semantics in web 2.0 social media , 2010, WSDM '10.

[25]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.