Ranking in Genealogy: Search Results Fusion at Ancestry

Genealogy research is the study of family history using available resources such as historical records. Ancestry provides its customers with one of the world's largest online genealogical index with billions of records from a wide range of sources, including vital records such as birth and death certificates, census records, court and probate records among many others. Search at Ancestry aims to return relevant records from various record types, allowing our subscribers to build their family trees, research their family history, and make meaningful discoveries about their ancestors from diverse perspectives. In a modern search engine designed for genealogical study, the appropriate ranking of search results to provide highly relevant information represents a daunting challenge. In particular, the disparity in historical records makes it inherently difficult to score records in an equitable fashion. Herein, we provide an overview of our solutions to overcome such record disparity problems in the Ancestry search engine. Specifically, we introduce customized coordinate ascent (customized CA) to speed up ranking within a specific record type. We then propose stochastic search (SS) that linearly combines ranked results federated across contents from various record types. Furthermore, we propose a novel information retrieval metric, normalized cumulative entropy (NCE), to measure the diversity of results. We demonstrate the effectiveness of these two algorithms in terms of relevance (by NDCG) and diversity (by NCE) if applicable in the offline experiments using real customer data at Ancestry.

[1]  James P. Callan,et al.  Query Transformations for Result Merging , 2014, TREC.

[2]  Yue Liu,et al.  ICTNET at Federated Web Search Track 2014 , 2014, TREC.

[3]  K. I. M. McKinnon,et al.  Convergence of the Nelder-Mead Simplex Method to a Nonstationary Point , 1998, SIAM J. Optim..

[4]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[5]  James P. Callan,et al.  Collection selection and results merging with topically organized U.S. patents and TREC data , 2000, CIKM '00.

[6]  Tong Zhang,et al.  Subset Ranking Using Regression , 2006, COLT.

[7]  Qiang Wu,et al.  Adapting boosting for information retrieval measures , 2010, Information Retrieval.

[8]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[9]  M. J. D. Powell,et al.  On search directions for minimization algorithms , 1973, Math. Program..

[10]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[11]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[12]  Milad Shokouhi,et al.  Robust result merging using sample-based score estimates , 2009, TOIS.

[13]  Wei Li,et al.  A stochastic learning-to-rank algorithm and its application to contextual advertising , 2011, WWW.

[14]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[15]  Claude E. Shannon,et al.  The Mathematical Theory of Communication , 1950 .

[16]  Luo Si,et al.  Using sampled data and regression to merge search engine results , 2002, SIGIR '02.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  James C. French,et al.  The impact of database selection on distributed searching , 2000, SIGIR '00.

[19]  John D. Lafferty,et al.  Beyond independent relevance: methods and evaluation metrics for subtopic retrieval , 2003, SIGIR.

[20]  John Dunnion,et al.  ProbFuse: a probabilistic approach to data fusion , 2006, SIGIR.

[21]  Ralf Krestel,et al.  Reranking web search results for diversity , 2011, Information Retrieval.

[22]  Filip Radlinski,et al.  Improving personalized web search using result diversification , 2006, SIGIR.

[23]  Milad Shokouhi,et al.  LambdaMerge: merging the results of query reformulations , 2011, WSDM '11.