Context sensitive stemming for web search

Traditionally, stemming has been applied to Information Retrieval tasks by transforming words in documents to the their root form before indexing, and applying a similar transformation to query terms. Although it increases recall, this naive strategy does not work well for Web Search since it lowers precision and requires a significant amount of additional computation. In this paper, we propose a context sensitive stemming method that addresses these two issues. Two unique properties make our approach feasible for Web Search. First, based on statistical language modeling, we perform context sensitive analysis on the query side. We accurately predict which of its morphological variants is useful to expand a query term with before submitting the query to the search engine. This dramatically reduces the number of bad expansions, which in turn reduces the cost of additional computation and improves the precision at the same time. Second, our approach performs a context sensitive document matching for those expanded variants. This conservative strategy serves as a safeguard against spurious stemming, and it turns out to be very important for improving precision. Using word pluralization handling as an example of our stemming approach, our experiments on a major Web search engine show that stemming only 29% of the query traffic, we can improve relevance as measured by average Discounted Cumulative Gain (DCG5) by 6.1% on these queriesand 1.8% over all query traffic.

[1]  Peter G. Anick Using terminological feedback for web search refinement: a log-based study , 2003, SIGIR.

[2]  Peter Boros,et al.  Query Segmentation for Web Search , 2003, WWW.

[3]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[4]  Ron Weiss,et al.  Fast and effective query refinement , 1997, SIGIR '97.

[5]  ChengXiang Zhai,et al.  Semantic term matching in axiomatic approaches to information retrieval , 2006, SIGIR.

[6]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[7]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[8]  Donna Harman,et al.  How effective is suffixing , 1991 .

[9]  Eija Airio Word normalization and decompounding in mono- and bilingual IR , 2006, Information Retrieval.

[10]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[11]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[12]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[13]  Peter Willett,et al.  An evaluation of some conflation algorithms for information retrieval , 1981 .

[14]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[15]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[16]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[17]  Stephen E. Robertson,et al.  On Term Selection for Query Expansion , 1991, J. Documentation.

[18]  W. Bruce Croft,et al.  A framework for selective query expansion , 2004, CIKM '04.

[19]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[20]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[21]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[22]  William B. Frakes Term Conflation for Information Retrieval , 1984, SIGIR.

[23]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[24]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[25]  Tat-Seng Chua,et al.  Mining dependency relations for query expansion in passage retrieval , 2006, SIGIR.