Learning to Rank Homepages For Researcher-Name Queries

Researchers constitute the ‘kernel’ entities in scientific digital library portals such as CiteSeerX 1 , DBLP 2 and Academic Search 3 . In addition, to the primary tasks related to search and browsing of research literature, digital libraries involve other tasks such as expertise modeling of researchers and social network analysis involving researcher entities. Most information required for these tasks needs to be extracted using the publications associated with a particular researcher along with the information provided by researchers on their homepages. To enable the collection of these homepages, we study the retrieval of researcher homepages from the Web using queries based on researcher names. We posit that researcher homepages are characterized by specific contentbased and structural features which can be effectively harnessed for identifying them. We use topic modeling as a means to identify features that are discipline-independent while learning the ranking function for homepage retrieval. On a large dataset based on researcher names from DBLP, we show that our ranking function obtains an increase in success rate from 3.2% to 21.28% at rank 1 and from 29.6% to 66.3% at rank 10 over the baseline retrieval model that uses a similarity function based on query-content match. We also obtain modest performance with our small set of (about 125) discipline-independent features on identifying the researcher homepages in the WebKB dataset.

[1]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[2]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[3]  Tie-Yan Liu,et al.  Directly optimizing evaluation measures in learning to rank , 2008, SIGIR.

[4]  Thorsten Joachims,et al.  Eye-tracking analysis of user behavior in WWW search , 2004, SIGIR '04.

[5]  Yang Song,et al.  CiteSeerχ: a scalable autonomous scientific digital library , 2006, InfoScale '06.

[6]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[7]  James Allan,et al.  Relevance feedback with too much data , 1995, SIGIR '95.

[8]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  David Hawking,et al.  Query-independent evidence in home page finding , 2003, TOIS.

[11]  Cornelia Caragea,et al.  On identifying academic homepages for digital libraries , 2011, JCDL '11.

[12]  Hongbo Deng,et al.  Formal Models for Expert Finding on DBLP Bibliography Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[13]  Efstathios Stamatatos,et al.  Learning to recognize webpage genres , 2009, Inf. Process. Manag..

[14]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[15]  Yuxin Wang,et al.  Web Page Classification Exploiting Contents of Surrounding Pages for Building a High-Quality Homepage Collection , 2006, ICADL.

[16]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[17]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[18]  Edward A. Fox,et al.  Machine Learning Approach for Homepage Finding Task , 2002, TREC.

[19]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Craig MacDonald,et al.  High Quality Expertise Evidence for Expert Search , 2008, ECIR.

[21]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[22]  Marina Santini,et al.  Automatic identification of genre in Web pages , 2011 .

[23]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.