Learning Hostname Preference to Enhance Search Relevance

Hostnames such as en.wikipedia.org and www.amazon.com are strong indicators of the content they host. The relevant hostnames for a query can be a signature that captures the query intent. In this study, we learn the hostname preference of queries, which are further utilized to enhance search relevance. Implicit and explicit query intent are modeled simultaneously by a feature aware matrix completion framework. A block-wise parallel algorithm was developed on top of the Spark MLlib for fast optimization of feature aware matrix completion. The optimization completes within minutes at the scale of a million × million matrix, which enables efficient experimental studies at the web scale. Evaluation of the learned hostname preference is performed both intrinsically on test errors, and extrinsically on the impact on search ranking relevance. Experimental results demonstrate that capturing hostname preference can significantly boost the retrieval performance.

[1]  Wei Wu,et al.  Learning query and document similarities from click-through bipartite graph with metadata , 2013, WSDM.

[2]  Rainer Gemulla,et al.  Distributed Matrix Completion , 2012, 2012 IEEE 12th International Conference on Data Mining.

[3]  Qiang Wu,et al.  Click-through prediction for news queries , 2009, SIGIR.

[4]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[5]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[6]  Jiawei Han,et al.  Heterogeneous graph-based intent learning with queries, web pages and Wikipedia concepts , 2014, WSDM.

[7]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[8]  Chao Liu,et al.  Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce , 2010, WWW '10.

[9]  Eric Brill,et al.  Subwebs for specialized search , 2004, SIGIR '04.

[10]  Yong Yu,et al.  A Parallel and Efficient Algorithm for Learning to Match , 2014, 2014 IEEE International Conference on Data Mining.

[11]  Gang Wang,et al.  Understanding user's query intent with wikipedia , 2009, WWW '09.

[12]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Yang Song,et al.  Searchable web sites recommendation , 2011, WSDM '11.

[15]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[16]  Inderjit S. Dhillon,et al.  Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems , 2012, 2012 IEEE 12th International Conference on Data Mining.

[17]  Fernando Diaz,et al.  Integration of news content into web results , 2009, WSDM '09.

[18]  Qiang Yang,et al.  Exploiting the hierarchical structure for link analysis , 2005, SIGIR '05.

[19]  Deepak Agarwal,et al.  Parallel matrix factorization for binary response , 2013, 2013 IEEE International Conference on Big Data.

[20]  Michael R. Lyu,et al.  Probabilistic factor models for web site recommendation , 2011, SIGIR.