A Bipartite Graph-Based Ranking Approach to Query Subtopics Diversification Focused on Word Embedding Features

Web search queries are usually vague, ambiguous, or tend to have multiple intents. Users have different search intents while issuing the same query. Understanding the intents through mining subtopics underlying a query has gained much interest in recent years. Query suggestions provided by search engines hold some intents of the original query, however, suggested queries are often noisy and contain a group of alternative queries with similar meaning. Therefore, identifying the subtopics covering possible intents behind a query is a formidable task. Moreover, both the query and subtopics are short in length, it is challenging to estimate the similarity between a pair of short texts and rank them accordingly. In this paper, we propose a method for mining and ranking subtopics where we introduce multiple semantic and content-aware features, a bipartite graphbased ranking (BGR) method, and a similarity function for short texts. Given a query, we aggregate the suggested queries from search engines as candidate subtopics and estimate the relevance of them with the given query based on word embedding and content-aware features by modeling a bipartite graph. To estimate the similarity between two short texts, we propose a Jensen-Shannon divergence based similarity function through the probability distributions of the terms in the top retrieved documents from a search engine. A diversified ranked list of subtopics covering possible intents of a query is assembled by balancing the relevance and novelty. We experimented and evaluated our method on the NTCIR-10 INTENT-2 and NTCIR-12 IMINE-2 subtopic mining test collections. Our proposed method outperforms the baselines, known related methods, and the official participants of the INTENT-2 and IMINE-2 competitions. key words: Subtopic Mining, Query Intent, Diversification, Word Embedding, Bipartite Graph

[1]  Tetsuya Sakai RD-004 NTCIREVAL : A Generic Toolkit for Information Access Evaluation , 2011 .

[2]  Se-Jong Kim,et al.  Subtopic mining using simple patterns and hierarchical structure of subtopic candidates from web documents , 2015, Inf. Process. Manag..

[3]  Shuaiqiang Wang,et al.  Mining and ranking users’ intents behind queries , 2015, Information Retrieval Journal.

[4]  Yiqun Liu,et al.  Overview of the NTCIR-12 IMine-2 Task , 2016, NTCIR.

[5]  Aristides Gionis,et al.  The query-flow graph: model and applications , 2008, CIKM '08.

[6]  W. Bruce Croft,et al.  Quality-biased ranking of web documents , 2011, WSDM '11.

[7]  Qiang Zhou,et al.  Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task , 2013, NTCIR.

[8]  Zhoujun Li,et al.  Mining Query Subtopics from Questions in Community Question Answering , 2015, AAAI.

[9]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[10]  Nga H. DO†a VisualTextualRank : An Extension of VisualRank to Large-Scale Video Shot Extraction Exploiting Tag Co-occurrence ∗ , 2014 .

[11]  Guillaume Cleuziou,et al.  Query log driven web search results clustering , 2014, SIGIR.

[12]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[13]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[14]  Masaki Aono,et al.  SEM12 at the NTCIR-10 INTENT-2 English Subtopic Mining Subtask , 2013, NTCIR.

[15]  Nattiya Kanhabua,et al.  Leveraging Dynamic Query Subtopics for Time-Aware Search Result Diversification , 2014, ECIR.

[16]  Craig MacDonald,et al.  Exploiting query reformulations for web search result diversification , 2010, WWW '10.

[17]  Hsin-Hsi Chen,et al.  Mining subtopics from different aspects for diversifying search results , 2012, Information Retrieval.

[18]  Filip Radlinski,et al.  Inferring query intent from reformulations and clicks , 2010, WWW '10.

[19]  Yiqun Liu,et al.  Overview of the NTCIR-10 INTENT-2 Task , 2013, NTCIR.

[20]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[21]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[22]  Yong Yu,et al.  Identification of ambiguous queries in web search , 2009, Inf. Process. Manag..

[23]  Tapas Kanungo,et al.  Machine Learned Sentence Selection Strategies for Query-Biased Summarization , 2008 .

[24]  Masaki Aono,et al.  Estimating a Ranked List of Human Genetic Diseases by Associating Phenotype-Gene with Gene-Disease Bipartite Graphs , 2015, ACM Trans. Intell. Syst. Technol..

[25]  Craig MacDonald,et al.  Search Result Diversification , 2015, Found. Trends Inf. Retr..

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  Yiqun Liu,et al.  THUIR at NTCIR-10 INTENT-2 Task , 2013, NTCIR.

[28]  Gaël Dias,et al.  HULTECH at the NTCIR-10 INTENT-2 Task: Discovering User Intents through Search Results Clustering , 2013, NTCIR.

[29]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[30]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[31]  W. Bruce Croft,et al.  Indri: A language-model based search engine for complex queries1 , 2005 .

[32]  Lei Cao,et al.  Bipartite Graph Based Entity Ranking for Related Entity Finding , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[33]  Yiqun Liu,et al.  Overview of the NTCIR-11 IMine Task , 2014, NTCIR.

[34]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[35]  Yiqun Liu,et al.  Summary of the NTCIR-10 INTENT-2 task: subtopic mining and search result diversification , 2013, SIGIR.

[36]  Min-Yen Kan,et al.  Functional Faceted Web Query Analysis , 2007 .

[37]  Gang Wang,et al.  Understanding user's query intent with wikipedia , 2009, WWW '09.

[38]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[39]  Fan Zhang,et al.  Mining subtopics from text fragments for a web query , 2013, Information Retrieval.

[40]  Tetsuya Sakai,et al.  Statistical reform in information retrieval? , 2014, SIGF.

[41]  Stephen E. Robertson,et al.  Ambiguous requests: implications for retrieval tests, systems and theories , 2007, SIGF.

[42]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[43]  Michael R. Lyu,et al.  A generalized Co-HITS algorithm and its application to bipartite graphs , 2009, KDD.

[44]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[45]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[46]  Gianni Amati,et al.  Probability models for information retrieval based on divergence from randomness , 2003 .

[47]  Olfa Nasraoui,et al.  Mining search engine query logs for query recommendation , 2006, WWW '06.

[48]  Yiqun Liu,et al.  Improve Web Search Diversification with Intent Subtopic Mining , 2013, NLPCC.

[49]  Se-Jong Kim,et al.  The KLE's Subtopic Mining System for the NTCIR-11 IMine Task , 2014, NTCIR.

[50]  Se-Jong Kim,et al.  Subtopic Mining Based on Three-Level Hierarchical Search Intentions , 2016, ECIR.

[51]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[52]  Tetsuya Sakai,et al.  Search Result Diversification Based on Hierarchical Intents , 2015, CIKM.

[53]  Se-Jong Kim,et al.  The KLE's Subtopic Mining System for the NTCIR-10 INTENT-2 Task , 2013, NTCIR.

[54]  Keiji Yanai,et al.  VisualTextualRank: An Extension of VisualRank to Large-Scale Video Shot Extraction Exploiting Tag Co-occurrence , 2015, IEICE Trans. Inf. Syst..