Click-words: learning to predict document keywords from a user perspective

Motivation: Recognizing words that are key to a document is important for ranking relevant scientific documents. Traditionally, important words in a document are either nominated subjectively by authors and indexers or selected objectively by some statistical measures. As an alternative, we propose to use documents' words popularity in user queries to identify click-words, a set of prominent words from the users' perspective. Although they often overlap, click-words differ significantly from other document keywords. Results: We developed a machine learning approach to learn the unique characteristics of click-words. Each word was represented by a set of features that included different types of information, such as semantic type, part of speech tag, term frequency–inverse document frequency (TF–IDF) weight and location in the abstract. We identified the most important features and evaluated our model using 6 months of PubMed click-through logs. Our results suggest that, in addition to carrying high TF–IDF weight, click-words tend to be biomedical entities, to exist in article titles, and to occur repeatedly in article abstracts. Given the abstract and title of a document, we are able to accurately predict the words likely to appear in user queries that lead to document clicks. Contact: luzh@ncbi.nlm.nih.gov Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[2]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[3]  David Hawking,et al.  Improving Rankings in Small-Scale Web Search Using Click Implied Descriptions , 2006, Aust. J. Intell. Inf. Process. Syst..

[4]  Po-Ting Lai,et al.  PubMed-EX: a web browser extension to enhance PubMed search with text mining features , 2009, Bioinform..

[5]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[6]  Weiguo Fan,et al.  Learning to advertise , 2006, SIGIR.

[7]  Zhiyong Lu,et al.  Understanding PubMed® user search behavior through log analysis , 2009, Database J. Biol. Databases Curation.

[8]  Mark Last,et al.  Graph-Based Keyword Extraction for Single-Document Summarization , 2008, COLING 2008.

[9]  Ariel Fuxman,et al.  Using the wisdom of the crowds for keyword generation , 2008, WWW.

[10]  Mounia Lalmas,et al.  SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , 2006 .

[11]  Shamkant B. Navathe,et al.  Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[12]  Joshua Goodman,et al.  Finding advertising keywords on web pages , 2006, WWW '06.

[13]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[14]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[15]  Hongyuan Zha,et al.  Global ranking by exploiting user clicks , 2009, SIGIR.

[16]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.

[17]  Vassilis Plachouras,et al.  Online learning from click data for sponsored search , 2008, WWW.

[18]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[19]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[20]  C. Federiuk The effect of abbreviations on MEDLINE searching. , 1999, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[21]  Sunghwan Sohn,et al.  Abbreviation definition identification based on automatic precision estimates , 2008, BMC Bioinformatics.

[22]  Carl J. Schmidt,et al.  Mining the Biomedical Literature for Genic Information , 2008, BioNLP.

[23]  Shamkant B. Navathe,et al.  Text Mining Functional Keywords Associated with Genes , 2004, MedInfo.

[24]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[25]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[26]  Zhiyong Lu,et al.  Viewpoint Paper: Evaluating Relevance Ranking Strategies for MEDLINE Retrieval , 2009, J. Am. Medical Informatics Assoc..

[27]  Qiang Yang,et al.  Mining Web Query Hierarchies from Clickthrough Data , 2007, AAAI.

[28]  Jia Zeng,et al.  Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity , 2009, Bioinform..

[29]  Linda A. Watson,et al.  Information Retrieval: A Health and Biomedical Perspective. , 2005 .

[30]  Xin Jiang,et al.  A ranking approach to keyphrase extraction , 2009, SIGIR.

[31]  Sophia Ananiadou,et al.  FACTA: a text search engine for finding associated biomedical concepts , 2008, Bioinform..

[32]  W. John Wilbur,et al.  How to interpret PubMed queries and why it matters , 2009, J. Assoc. Inf. Sci. Technol..