Subtopic mining using simple patterns and hierarchical structure of subtopic candidates from web documents

We use only web document collection instead of query logs and external resources.Our simple patterns are based on noun phrases and alternative partial-queries.We maintain a balance between popularity and diversity of subtopics.Our method covered various search intentions of a query by its few subtopics.Our results were steadily improved by extracting more relevant and various subtopics. The intention gap between users and queries results in ambiguous and broad queries. To solve these problems, subtopic mining has been studied, which returns a ranked list of possible subtopics according to their relevance, popularity, and diversity. This paper proposes a novel method to mine subtopics using simple patterns and a hierarchical structure of subtopic candidates. First, relevant and various phrases are extracted as subtopic candidates using simple patterns based on noun phrases and alternative partial-queries. Second, a hierarchical structure of the subtopic candidates is constructed using sets of relevant documents from a web document collection. Finally, the subtopic candidates are ranked considering a balance between popularity and diversity using this structure. In experiments, our proposed methods outperformed the baselines and even an external resource based method at high-ranked subtopics, which shows that our methods can be effective and useful in various search scenarios like result diversification.

[1]  Stephen E. Robertson,et al.  Simple Evaluation Metrics for Diversified Search Results , 2010, EVIA@NTCIR.

[2]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[3]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[4]  Xueqi Cheng,et al.  A unified framework for recommending diverse and relevant queries , 2011, WWW.

[5]  Se-Jong Kim,et al.  Hierarchical subtopic mining for topic annotation , 2013, ESAIR '13.

[6]  Avi Arampatzis,et al.  Phase-Based Information Retrieval , 1998, Inf. Process. Manag..

[7]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[8]  Kai Lu,et al.  ICTIR Subtopic Mining System at NTCIR-9 INTENT Task , 2011, NTCIR.

[9]  W. Bruce Croft,et al.  Query reformulation using anchor text , 2010, WSDM '10.

[10]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[11]  B. U. Kannappanavar,et al.  Information and Knowledge Management , 2007 .

[12]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[13]  Arjen P. de Vries,et al.  Combining implicit and explicit topic representations for result diversification , 2012, SIGIR '12.

[14]  Yiqun Liu,et al.  Overview of the NTCIR-9 INTENT Task , 2011, NTCIR.

[15]  W. Bruce Croft,et al.  Term level search result diversification , 2013, SIGIR.

[16]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[17]  Tetsuya Sakai RD-004 NTCIREVAL : A Generic Toolkit for Information Access Evaluation , 2011 .

[18]  Michael R. Lyu,et al.  Diversifying Query Suggestion Results , 2010, AAAI.

[19]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[20]  Avi Arampatzis,et al.  Phrase-based Information Retrieval , 1998 .

[21]  W. Bruce Croft,et al.  Inferring query aspects from reformulations using clustering , 2011, CIKM '11.

[22]  Craig MacDonald,et al.  University of Glasgow at the NTCIR-9 Intent task: Experiments with Terrier on Subtopic Mining and Document Ranking , 2011, NTCIR.

[23]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[24]  Rosie Jones,et al.  The Linguistic Structure of English Web-Search Queries , 2008, EMNLP.

[25]  Yiqun Liu,et al.  THUIR at NTCIR-9 INTENT Task , 2011, NTCIR.

[26]  Prasenjit Mitra,et al.  Query suggestions in the absence of query logs , 2011, SIGIR.

[27]  Craig MacDonald,et al.  Exploiting query reformulations for web search result diversification , 2010, WWW '10.

[28]  W. Bruce Croft,et al.  Diversity by proportionality: an election-based approach to search result diversification , 2012, SIGIR '12.

[29]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[30]  Reiner Kraft,et al.  Mining anchor text for query refinement , 2004, WWW '04.

[31]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.