Measuring the interestingness of articles in a limited user environment

Search engines, such as Google, assign scores to news articles based on their relevance to a query. However, not all relevant articles for the query may be interesting to a user. For example, if the article is old or yields little new information, the article would be uninteresting. Relevance scores do not take into account what makes an article interesting, which would vary from user to user. Although methods such as collaborative filtering have been shown to be effective in recommendation systems, in a limited user environment, there are not enough users that would make collaborative filtering effective. A general framework, called iScore, is presented for defining and measuring the ''interestingness'' of articles, incorporating user-feedback. iScore addresses the various aspects of what makes an article interesting, such as topic relevance, uniqueness, freshness, source reputation, and writing style. It employs various methods, such as multiple topic tracking, online parameter selection, language models, clustering, sentiment analysis, and phrase extraction to measure these features. Due to varying reasons that users hold about why an article is interesting, an online feature selection method in nai@?ve Bayes is also used to improve recommendation results. iScore can outperform traditional IR techniques by as much as 50.7%. iScore and its components are evaluated in the news recommendation task using three datasets from Yahoo! News, actual users, and Digg.

[1]  Jun Wang,et al.  Unifying user-based and item-based collaborative filtering approaches by similarity fusion , 2006, SIGIR.

[2]  David Buttler,et al.  iScore: Measuring the Interestingness of Articles in a Limited User Environment , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[3]  Alexander S. Szalay,et al.  Very Fast Outlier Detection in Large Multidimensional Data Sets , 2002, DMKD.

[4]  James Allan,et al.  Detection As Multi-Topic Tracking , 2002, Information Retrieval.

[5]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[6]  Stephen E. Robertson,et al.  The TREC 2002 Filtering Track Report , 2002, TREC.

[7]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[8]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[9]  Bin Liu,et al.  TREC 11 Experiments at CAS-ICT: Filtering and Web , 2002, TREC.

[10]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[11]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 2002: Filtering Track , 2002, TREC.

[12]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[13]  Dale Schuurmans,et al.  Language and Task Independent Text Categorization with Simple Language Models , 2003, NAACL.

[14]  Rohini K. Srihari,et al.  Using Verbs and Adjectives to Automatically Classify Blog Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[15]  Eleazar Eskin,et al.  Detecting Errors within a Corpus using Anomaly Detection , 2000, ANLP.

[16]  Gilbert H. Young,et al.  ACTION: automatic classification for full-text documents , 1996, SIGF.

[17]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[19]  Janyce Wiebe,et al.  Learning Subjective Adjectives from Corpora , 2000, AAAI/IAAI.

[20]  Min Zhang,et al.  Incremental Learning for Profile Training in Adaptive Document Filtering , 2002, TREC.

[21]  P. Nurmi,et al.  Online feature selection for contextual time series data ( Extended abstract ) , 2022 .

[22]  Ricardo Carreira,et al.  Evaluating adaptive user profiles for news classification , 2004, IUI '04.

[23]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[24]  Janyce Wiebe,et al.  Instructions for annotating opinions in newspaper articles , 2002 .

[25]  Peter J. Denning Hastily formed networks , 2006, CACM.

[26]  Christophe Brouard,et al.  CLIPS at TREC 11: Experiments in Filtering , 2002, TREC.

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  Junyu Niu,et al.  FDU at TREC 2002: Filtering, Q&A, Web and Video Tasks , 2002, TREC.

[29]  David Buttler,et al.  Improving Naive Bayes with Online Feature Selection for Quick Adaptation to Evolving Feature Usefulness , 2007 .

[30]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[31]  Matthew J. Rattigan,et al.  The case for anomalous link detection , 2005, MRDM '05.

[32]  David Buttler,et al.  Tracking multiple topics for finding interesting articles , 2007, KDD '07.

[33]  Paul E. Utgoff,et al.  Decision Tree Induction Based on Efficient Tree Restructuring , 1997, Machine Learning.

[34]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[35]  Clara Pizzuti,et al.  Detection and prediction of distance-based outliers , 2005, SAC '05.

[36]  Michael J. Pazzani,et al.  A learning agent for wireless news access , 2000, IUI '00.

[37]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[38]  David Buttler,et al.  Online selection of parameters in the rocchio algorithm for identifying interesting news articles , 2008, WIDM '08.

[39]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[40]  Jiu-Zhen Liang SVM multi-classifier and Web document classification , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[41]  Vasant Dhar,et al.  Intelligent information triage , 2001, SIGIR '01.

[42]  Rick Hayes-Roth Two Theories of Process Design for Information Superiority: Smart Pull vs. Smart Push , 2006 .

[43]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[44]  V. Rao Vemuri,et al.  Using Text Categorization Techniques for Intrusion Detection , 2002, USENIX Security Symposium.

[45]  Catherine Loader,et al.  New technique for finding needles in haystacks: geometric approach to distinguishing between a new source and random fluctuations. , 2005, Physical review letters.

[46]  Jiawei Han,et al.  Text classification from positive and unlabeled documents , 2003, CIKM '03.

[47]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[48]  Hisham Al-Mubaid,et al.  A New Text Categorization Technique Using Distributional Clustering and Learning Logic , 2006, IEEE Transactions on Knowledge and Data Engineering.

[49]  Laurie E. Damianos,et al.  MiTAP: a case study of integrated knowledge discovery tools , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[50]  Shlomo Argamon,et al.  Stylistic text segmentation , 2006, SIGIR '06.

[51]  G. A. Mishne,et al.  Expiriments with mood classification in blog posts , 2005, SIGIR 2005.

[52]  Tefko Saracevic,et al.  Relevance : A Review of the Literature and a Framework for Thinking on the Notion in Information Science . Part III : Behavior and Effects of Relevance , 1976 .

[53]  Yi-Cheng Ku,et al.  Customized Internet news services based on customer profiles , 2003, ICEC '03.

[54]  Weiguo Fan,et al.  WebInEssence: A Personalized Web-Based Multi-Document Summarization and Recommendation System , 2008 .

[55]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[56]  Tefko Saracevic Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance , 2007 .

[57]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[58]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[59]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[60]  Helena Ahonen-Myka,et al.  Simple Semantics in Topic Detection and Tracking , 2004, Information Retrieval.

[61]  Gerhard Weikum,et al.  Stylistic Analysis Of Text For Information Access , 2005 .

[62]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[63]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[64]  Rong Yan,et al.  Probabilistic latent query analysis for combining multiple retrieval sources , 2006, SIGIR.

[65]  William W. Cohen,et al.  Single-pass online learning: performance, voting schemes and online feature selection , 2006, KDD '06.

[66]  Martin Franz,et al.  Unsupervised and supervised clustering for topic tracking , 2001, SIGIR '01.

[67]  Francesco Romani,et al.  Ranking a stream of news , 2005, WWW '05.

[68]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).