Evaluating document filtering systems over time

We propose a new way of measuring document filtering system performance over time.Performance is calculated per batch and a trend line is fitted to the results.Systems are compared by their performance at the end of the evaluation period.Important insights emerge by re-evaluating TREC KBA CCR runs of 2012 and 2013. Document filtering is a popular task in information retrieval. A stream of documents arriving over time is filtered for documents relevant to a set of topics. The distinguishing feature of document filtering is the temporal aspect introduced by the stream of documents. Document filtering systems, up to now, have been evaluated in terms of traditional metrics like (micro- or macro-averaged) precision, recall, MAP, nDCG, F1 and utility. We argue that these metrics do not capture all relevant aspects of the systems being evaluated. In particular, they lack support for the temporal dimension of the task. We propose a time-sensitive way of measuring performance of document filtering systems over time by employing trend estimation. In short, the performance is calculated for batches, a trend line is fitted to the results, and the estimated performance of systems at the end of the evaluation period is used to compare systems. We detail the application of our proposed trend estimation framework and examine the assumptions that need to hold for valid significance testing. Additionally, we analyze the requirements a document filtering metric has to meet and show that traditional macro-averaged true-positive-based metrics, like precision, recall and utility fail to capture essential information when applied in a batch setting. In particular, false positives returned in a batch for topics that are absent from the ground truth in that batch go unnoticed. This is a serious flaw as over-generation of a system might be overlooked this way. We propose a new metric, aptness, that does capture false positives. We incorporate this metric in an overall score and show that this new score does meet all requirements. To demonstrate the results of our proposed evaluation methodology, we analyze the runs submitted to the two most recent editions of a document filtering evaluation campaign. We re-evaluate the runs submitted to the Cumulative Citation Recommendation task of the 2012 and 2013 editions of the TREC Knowledge Base Acceleration track, and show that important new insights emerge.

[1]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[2]  Laura Dietz,et al.  UMass at TREC 2013 Knowledge Base Acceleration Track: Bi-directional Entity Linking and Time-aware Evaluation , 2013, TREC.

[3]  Hui Fang,et al.  A Related Entity based Approach for Knowledge Base Acceleration , 2013, TREC.

[4]  John Domingue,et al.  Beyond TREC's Filtering Track , 2004, LREC.

[5]  J. Durbin,et al.  Testing for serial correlation in least squares regression. I. , 1950, Biometrika.

[6]  Thomas G. Dietterich,et al.  Evaluating Online Text Classification Algorithms for Email Prediction in TaskTracer , 2009 .

[7]  James Jaccard,et al.  Statistics for the Behavioral Sciences , 1983 .

[8]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[9]  Mohand Boughanem,et al.  IRIT at TREC Knowledge Base Acceleration 2013: Cumulative Citation Recommendation Task , 2013, TREC.

[10]  David A. Hull The TREC-7 Filtering Track: Description and Analysis , 1998, Text Retrieval Conference.

[11]  de RijkeMaarten,et al.  Evaluating document filtering systems over time , 2015 .

[12]  Y. B. Wah,et al.  Power comparisons of Shapiro-Wilk , Kolmogorov-Smirnov , Lilliefors and Anderson-Darling tests , 2011 .

[13]  Ellen M. Voorhees,et al.  Evaluating Stream Filtering for Entity Profile Updates for TREC 2013 , 2013, TREC.

[14]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[15]  Christopher Rao,et al.  Graphs in Statistical Analysis , 2010 .

[16]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[17]  Jun Wang,et al.  Portfolio theory of information retrieval , 2009, SIGIR.

[18]  A. Piquero,et al.  USING THE CORRECT STATISTICAL TEST FOR THE EQUALITY OF REGRESSION COEFFICIENTS , 1998 .

[19]  M. de Rijke,et al.  On the Assessment of Expertise Profiles , 2013, DIR.

[20]  Julia Kastner,et al.  Introduction to Robust Estimation and Hypothesis Testing , 2005 .

[21]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[22]  Leif Azzopardi Usage based effectiveness measures: monitoring application performance in information retrieval , 2009, CIKM.

[23]  Yiming Yang,et al.  Information Filtering in TREC-9 and TDT-3: A Comparative Analysis , 2002, Information Retrieval.

[24]  A. Hayes,et al.  Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation , 2007, Behavior research methods.

[25]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[26]  Stephen E. Robertson,et al.  Introduction to the Special Issue: Overview of the TREC Routing and Filtering Tasks , 2002, Information Retrieval.

[27]  J. S. Long,et al.  Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model , 2000 .

[28]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[29]  Stephen E. Robertson,et al.  Building a filtering test collection for TREC 2002 , 2003, SIGIR.

[30]  Andy P. Field,et al.  Discovering Statistics Using Ibm Spss Statistics , 2017 .

[31]  Lejian Liao,et al.  BIT and MSRA at TREC KBA CCR Track 2013 , 2013, TREC.

[32]  Peter Spichtinger,et al.  Application and Comparison of Robust Linear Regression Methods for Trend Estimation , 2009 .

[33]  Julio Gonzalo,et al.  A Comparison of Evaluation Metrics for Document Filtering , 2011, CLEF.

[34]  Stephen E. Robertson,et al.  The TREC-8 Filtering Track Final Report , 1999, TREC.

[35]  K. Balog,et al.  Time-aware Evaluation of Cumulative Citation Recommendation Systems , 2013 .

[36]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[37]  Craig Willis,et al.  The University of Illinois' Graduate School of Library and Information Science at TREC 2011 , 2011, TREC.

[38]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[39]  C. Clogg,et al.  Statistical Methods for Comparing Regression Coefficients Between Models , 1995, American Journal of Sociology.

[40]  Tom Kenter Filtering Documents over Time on Evolving Topics - The University of Amsterdam at TREC 2013 KBA CCR , 2013, TREC.

[41]  Tetsuya Sakai,et al.  TREC 2013 Temporal Summarization , 2013, TREC.

[42]  M. Bianchi,et al.  A comparison of methods for trend estimation , 1999 .

[43]  Hans Peter Luhn,et al.  A Business Intelligence System , 1958, IBM J. Res. Dev..

[44]  Donna Harman,et al.  Information Processing and Management , 2022 .

[45]  Stephen E. Robertson,et al.  The TREC 2002 Filtering Track Report , 2002, TREC.

[46]  Tetsuya Sakai,et al.  On the reliability and intuitiveness of aggregated search metrics , 2013, CIKM.

[47]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[48]  Feng Niu,et al.  Building an Entity-Centric Stream Filtering Test Collection for TREC 2012 , 2012, TREC.

[49]  J. Durbin,et al.  Testing for serial correlation in least squares regression. II. , 1950, Biometrika.

[50]  Leif Azzopardi,et al.  The economics in interactive information retrieval , 2011, SIGIR.