论文信息 - Keeping keywords fresh: a BM25 variation for personalized keyword extraction - 字舞流文

Keeping keywords fresh: a BM25 variation for personalized keyword extraction

Keyword extraction from web pages is essential to various text mining tasks including contextual advertising, recommendation selection, user profiling and personalization. For example, extracted keywords in contextual advertising are used to match advertisements with the web page currently browsed by a user. Most of the keyword extraction methods mainly rely on the content of a single web page, ignoring the browsing history of a user, and hence, potentially leading to the same advertisements or recommendations. In this work we propose a new feature scoring algorithm for web page terms extraction that, assuming a recent browsing history per user, takes into account the freshness of keywords in the current page as means of shifting users interests. We propose BM25H, a variant of BM25 scoring function, implemented on the client-side, that takes into account the user browsing history and suggests keywords relevant to the currently browsed page, but also fresh with respect to the user's recent browsing history. In this way, for each web page we obtain a set of keywords, representing the time shifting interests of the user. BM25H avoids repetitions of keywords which may be simply domain specific stop-words, or may result in matching the same ads or similar recommendations. Our experimental results show that BM25H achieves more than 70% in precision at 20 extracted keywords (based on human blind evaluation) and outperforms our baselines (TF and BM25 scoring functions), while it succeeds in keeping extracted keywords fresh compared to recent user history.

Michalis Vazirgiannis | Vassilis Plachouras | Margarita Karkali | Constantinos Stefanatos

[1] Robert Leibscher,et al. Temporal Context: Applications and Implications for Computational Linguistics , 2004, ACL.

[2] Andrei Z. Broder,et al. A semantic approach to contextual advertising , 2007, SIGIR.

[3] Minoru Uehara,et al. Adaptive calculation of scores for fresh information retrieval , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).

[4] Miles Efron,et al. Linear time series models for term weighting in information retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[5] Christopher Olston,et al. What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[6] Andrei Z. Broder,et al. Just-in-time contextual advertising , 2007, CIKM '07.

[7] Adam Jatowt,et al. Visualizing historical content of web pages , 2008, WWW.

[8] Susan T. Dumais,et al. Leveraging temporal dynamics of document content in relevance ranking , 2010, WSDM '10.

[9] Marc Najork,et al. A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[10] Yutaka Matsuo. Word Weighting Based on User's Browsing History , 2003, User Modeling.

[11] Akimichi Tanaka,et al. Search your interests everywhere!: wikipedia-based keyphrase extraction from web browsing history , 2010, HT '10.

[12] Berthier A. Ribeiro-Neto,et al. Impedance coupling in content-targeted advertising , 2005, SIGIR '05.

[13] Dimitrios Gunopulos,et al. On burstiness-aware search for document sequences , 2009, KDD.

[14] Joshua Goodman,et al. Finding advertising keywords on web pages , 2006, WWW '06.

[15] Stephen E. Robertson,et al. Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.

[16] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[17] Stephen E. Robertson,et al. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[18] Ravi Kumar,et al. A characterization of online browsing behavior , 2010, WWW '10.

[19] Dimitrios Gunopulos,et al. Identifying similarities, periodicities and bursts for online search queries , 2004, SIGMOD '04.

[20] Evgeniy Gabrilovich,et al. Using the past to score the present: extending term weighting models through revision history analysis , 2010, CIKM.