Cheap IR evaluation

To evaluate Information Retrieval (IR) effectiveness, a possible approach is to use test collections, which are composed of a collection of documents, a set of description of information needs (called topics), and a set of relevant documents to each topic. Test collections are modelled in a competition scenario: for example, in the well known TREC initiative, participants run their own retrieval systems over a set of topics and they provide a ranked list of retrieved documents; some of the retrieved documents (usually the first ranked) constitute the so called pool, and their relevance is evaluated by human assessors; the document list is then used to compute effectiveness metrics and rank the participant systems. Private Web Search companies also run their in-house evaluation exercises; although the details are mostly unknown, and the aims are somehow different, the overall approach shares several issues with the test collection approach. The aim of this work is to: (i) develop and improve some state-of-the-art work on the evaluation of IR effectiveness while saving resources, and (ii) propose a novel, more principled and engineered, overall approach to test collection based effectiveness evaluation. [...]

[1]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[2]  Pu Li,et al.  Test theory for assessing IR test collections , 2007, SIGIR.

[3]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[4]  Tague-SutcliffeJean The pragmatics of information retrieval experimentation, revisited , 1992 .

[5]  Ben Carterette,et al.  Hypothesis testing with incomplete relevance judgments , 2007, CIKM '07.

[6]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[7]  Mónica Marrero,et al.  On the measurement of test collection reliability , 2013, SIGIR.

[8]  Stefano Mizzaro,et al.  Reproduce and Improve , 2018, ACM J. Data Inf. Qual..

[9]  J. Shane Culpepper,et al.  The effect of pooling and evaluation depth on IR metrics , 2016, Information Retrieval Journal.

[10]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[11]  Ben Carterette,et al.  On rank correlation and the distance between rankings , 2009, SIGIR.

[12]  G. Gescheider Psychophysics: The Fundamentals , 1997 .

[13]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[14]  Djoerd Hiemstra,et al.  Relying on topic subsets for system ranking estimation , 2009, CIKM.

[15]  Marco Basaldella,et al.  Crowdsourcing Relevance Assessments: The Unexpected Benefits of Limiting the Time to Judge , 2016, HCOMP.

[16]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[17]  Kevin Roitero CHEERS: CHeap & Engineered Evaluation of Retrieval Systems , 2018, SIGIR.

[18]  Amanda Spink,et al.  From Highly Relevant to Not Relevant: Examining Different Regions of Relevance , 1998, Inf. Process. Manag..

[19]  Emine Yilmaz,et al.  Representative & Informative Query Selection for Learning to Rank using Submodular Functions , 2015, SIGIR.

[20]  Peter Bailey,et al.  UQV100: A Test Collection with Query Variability , 2016, SIGIR.

[21]  James E. Bartlett,et al.  Organizational research: Determining appropriate sample size in survey research , 2001 .

[22]  Lei Han,et al.  All Those Wasted Hours: On Task Abandonment in Crowdsourcing , 2019, WSDM.

[23]  Stefano Mizzaro,et al.  Economic Evaluation of Recommender Systems: A Proposal , 2017, IIR.

[24]  Donna K. Harman,et al.  Overview of the Reliable Information Access Workshop , 2009, Information Retrieval.

[25]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[26]  Matthew Lease,et al.  Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments , 2016, HCOMP.

[27]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[28]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[29]  Oren Kurland,et al.  Query Performance Prediction Using Reference Lists , 2016, ACM Trans. Inf. Syst..

[30]  Eddy Maddalena,et al.  Crowd Worker Strategies in Relevance Judgment Tasks , 2020, WSDM.

[31]  Ellen M. Voorhees,et al.  TREC 2014 Web Track Overview , 2015, TREC.

[32]  James Allan,et al.  Comparing In Situ and Multidimensional Relevance Judgments , 2017, SIGIR.

[33]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[34]  Stefano Mizzaro,et al.  Improving the Efficiency of Retrieval Effectiveness Evaluation: Finding a Few Good Topics with Clustering? , 2016, IIR.

[35]  Alistair Moffat,et al.  Models and metrics: IR evaluation as a user process , 2012, ADCS.

[36]  Anselm Spoerri,et al.  Using the structure of overlap between search results to rank retrieval systems without relevance judgments , 2007, Inf. Process. Manag..

[37]  Peter Ingwersen,et al.  Dimensions of relevance , 2000, Inf. Process. Manag..

[38]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[39]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[40]  Falk Scholer,et al.  On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation , 2017, ACM Trans. Inf. Syst..

[41]  Ingemar J. Cox,et al.  On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.

[42]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[43]  Eddy Maddalena,et al.  Considering Assessor Agreement in IR Evaluation , 2017, ICTIR.

[44]  Stefano Mizzaro,et al.  Bias and Fairness in Effectiveness Evaluation by Means of Network Analysis and Mixture Models , 2019, IIR.

[45]  Eddy Maddalena,et al.  On Fine-Grained Relevance Scales , 2018, SIGIR.

[46]  Mark Sanderson,et al.  Problems with Kendall's tau , 2007, SIGIR.

[47]  Allan Hanbury,et al.  Assessors Agreement: A Case Study Across Assessor Type, Payment Levels, Query Variations and Relevance Dimensions , 2016, CLEF.

[48]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[49]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[50]  Ingemar J. Cox,et al.  Selecting a Subset of Queries for Acquisition of Further Relevance Judgements , 2011, ICTIR.

[51]  Stefano Mizzaro,et al.  Towards Stochastic Simulations of Relevance Profiles , 2019, CIKM.

[52]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[53]  Eddy Maddalena,et al.  IRevalOO: An Object Oriented Framework for Retrieval Evaluation , 2018, SIGIR.

[54]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[55]  Ingemar J. Cox,et al.  Prioritizing relevance judgments to improve the construction of IR test collections , 2011, CIKM '11.

[56]  Allan Hanbury,et al.  The Impact of Fixed-Cost Pooling Strategies on Test Collection Bias , 2016, ICTIR.

[57]  Gabriella Kazai INitiative for the Evaluation of XML Retrieval , 2009, Encyclopedia of Database Systems.

[58]  Rabia Nuray-Turan,et al.  Automatic ranking of retrieval systems in imperfect environments , 2003, SIGIR '03.

[59]  Tetsuya Sakai,et al.  On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..

[60]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[61]  Rong Tang,et al.  Towards the Identification of the Optimal Number of Relevance Categories , 1999, J. Am. Soc. Inf. Sci..

[62]  J. Aslam,et al.  A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[63]  Norbert Fuhr,et al.  Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.

[64]  Eddy Maddalena,et al.  On Transforming Relevance Scales , 2019, CIKM.

[65]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[66]  Hayato Yamana,et al.  Overview of the NTCIR-5 WEB Navigational Retrieval Subtask 2 (Navi-2) , 2005, NTCIR.

[67]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[68]  Pengfei Li,et al.  On the Effectiveness of Query Weighting for Adapting Rank Learners to New Unlabelled Collections , 2016, CIKM.

[69]  Neha Gupta,et al.  Modus Operandi of Crowd Workers , 2017, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[70]  Ahmed Abbasi,et al.  Benchmarking Twitter Sentiment Analysis Tools , 2014, LREC.

[71]  Fernando Diaz,et al.  Vertical selection in the presence of unlabeled verticals , 2010, SIGIR '10.

[72]  Stephen E. Robertson,et al.  On GMAP: and other transformations , 2006, CIKM '06.

[73]  C. J. van Rijsbergen,et al.  Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW , 2001 .

[74]  William Yang Wang “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection , 2017, ACL.

[75]  J. Shane Culpepper,et al.  Fewer topics? A million topics? Both?! On topics subsets in test collections , 2020, Inf. Retr. J..

[76]  Julián Urbano,et al.  Stochastic Simulation of Test Collections: Evaluation Scores , 2018, SIGIR.

[77]  Chris Buckley,et al.  Topic prediction based on comparative retrieval rankings , 2004, SIGIR '04.

[78]  J. Shane Culpepper,et al.  On Topic Difficulty in IR Evaluation: The Effect of Systems, Corpora, and System Components , 2019, SIGIR.

[79]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[80]  Stephen E. Robertson,et al.  A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.

[81]  Daniele Fanelli,et al.  Negative results are disappearing from most disciplines and countries , 2011, Scientometrics.

[82]  Elad Yom-Tov,et al.  Estimating the query difficulty for information retrieval , 2010, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[83]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[84]  Tetsuya Sakai,et al.  Ranking Retrieval Systems without Relevance Assessments: Revisited , 2010, EVIA@NTCIR.

[85]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[86]  Eddy Maddalena,et al.  The Impact of Task Abandonment in Crowdsourcing , 2019, IEEE Transactions on Knowledge and Data Engineering.

[87]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[88]  Ben Carterette,et al.  Preference based evaluation measures for novelty and diversity , 2013, SIGIR.

[89]  Tamer Elsayed,et al.  Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging , 2017, Inf. Process. Manag..

[90]  Javed A. Aslam,et al.  On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.

[91]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[92]  SaracevicTefko Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance , 2007 .

[93]  Guido Zuccon,et al.  Overview of the CLEF 2018 Consumer Health Search Task , 2018, CLEF.

[94]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[95]  Allan Hanbury,et al.  MM: A new Framework for Multidimensional Evaluation of Search Engines , 2018, CIKM.

[96]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[97]  Stefano Mizzaro,et al.  How Many Truth Levels? Six? One Hundred? Even More? Validating Truthfulness of Statements via Crowdsourcing , 2018, CIKM Workshops.

[98]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[99]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[100]  A. E. Eiben,et al.  Introduction to Evolutionary Computing 2nd Edition , 2020 .

[101]  Julián Urbano,et al.  Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation , 2016, Information Retrieval Journal.

[102]  Mounia Lalmas,et al.  Report on the INEX 2003 workshop , 2004, SIGF.

[103]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[104]  Tim Berners-Lee,et al.  Information Management: A Proposal , 1990 .

[105]  Oren Kurland,et al.  Predicting Query Performance by Query-Drift Estimation , 2009, TOIS.

[106]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[107]  Eddy Maddalena,et al.  Let's Agree to Disagree: Fixing Agreement Measures for Crowdsourcing , 2017, HCOMP.

[108]  Laurence A. Marschall,et al.  Null and Void , 1999 .

[109]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[110]  Omar Alonso,et al.  Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..

[111]  Stefano Mizzaro,et al.  Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments , 2018, ACM J. Data Inf. Qual..

[112]  J. Knight Negative results: Null and void , 2003, Nature.

[113]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[114]  Stefano Mizzaro,et al.  HITS Hits Readersourcing: Validating Peer Review Alternatives Using Network Analysis , 2019, BIRNDL@SIGIR.

[115]  Guido Zuccon,et al.  Understandability Biased Evaluation for Information Retrieval , 2016, ECIR.

[116]  Josiane Mothe,et al.  Human-Based Query Difficulty Prediction , 2017, ECIR.

[117]  J. Shane Culpepper,et al.  Improving test collection pools with machine learning , 2014, ADCS.

[118]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[119]  Stefano Mizzaro,et al.  IR Evaluation without a Common Set of Topics , 2009, ICTIR.

[120]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[121]  James Allan,et al.  If I Had a Million Queries , 2009, ECIR.

[122]  Stefano Mizzaro,et al.  Effectiveness Evaluation with a Subset of Topics: A Practical Approach , 2018, SIGIR.

[123]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[124]  Anselm Spoerri,et al.  How the overlap between the search results of different retrieval systems correlates with document relevance , 2006, ASIST.

[125]  Gerard Salton,et al.  The SMART Information Retrieval System after 30 years - Panel. , 1991, SIGIR 1991.

[126]  Hsin-Hsi Chen,et al.  Overview of CLIR Task at the Fourth NTCIR Workshop , 2004, NTCIR.

[127]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[128]  Emine Yilmaz,et al.  Document selection methodologies for efficient and effective learning-to-rank , 2009, SIGIR.

[129]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[130]  T. Saracevic,et al.  Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance , 2007, J. Assoc. Inf. Sci. Technol..

[131]  Rabia Nuray-Turan,et al.  Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..

[132]  Falk Scholer,et al.  Effective Pre-retrieval Query Performance Prediction Using Similarity and Variability Evidence , 2008, ECIR.

[133]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[134]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[135]  Maarten de Rijke,et al.  Balancing Relevance Criteria through Multi-Objective Optimization , 2016, SIGIR.

[136]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[137]  David Zhang,et al.  Learning Domain-Invariant Subspace Using Domain Features and Independence Maximization , 2016, IEEE Transactions on Cybernetics.

[138]  David Maxwell Chickering,et al.  Here or There , 2008, ECIR.

[139]  Carsten Eickhoff,et al.  Cognitive Biases in Crowdsourcing , 2018, WSDM.

[140]  Peter Willett,et al.  Document Retrieval Systems , 1988 .

[141]  Handbook of Parametric and Nonparametric Statistical Procedures , 2004 .

[142]  Stephen E. Robertson,et al.  On Using Fewer Topics in Information Retrieval Evaluations , 2013, ICTIR.

[143]  Falk Scholer,et al.  The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation , 2015, SIGIR.

[144]  Noriko Kando,et al.  Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science" , 2016, SIGIR Forum.

[145]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[146]  David E. Losada,et al.  Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems , 2017, Inf. Process. Manag..

[147]  Peter Bailey,et al.  Tasks, Queries, and Rankers in Pre-Retrieval Performance Prediction , 2017, ADCS.

[148]  Donna K. Harman,et al.  The NRRC reliable information access (RIA) workshop , 2004, SIGIR '04.

[149]  Ben Carterette,et al.  Low-cost and robust evaluation of information retrieval systems , 2008, SIGF.

[150]  Tetsuya Sakai,et al.  Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.

[151]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[152]  Djoerd Hiemstra,et al.  A survey of pre-retrieval query performance predictors , 2008, CIKM '08.

[153]  Vannevar Bush,et al.  As we may think , 1945, INTR.

[154]  Tetsuya Sakai,et al.  Designing Test Collections for Comparing Many Systems , 2014, CIKM.

[155]  Stephen E. Robertson,et al.  On the Contributions of Topics to System Evaluation , 2011, ECIR.

[156]  Jakob Grue Simonsen,et al.  Evaluation Measures for Relevance and Credibility in Ranked Lists , 2017, ICTIR.

[157]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[158]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[159]  P. Fishburn Condorcet Social Choice Functions , 1977 .

[160]  Franciska de Jong,et al.  Retrieval system evaluation: automatic evaluation versus incomplete judgments , 2010, SIGIR '10.

[161]  Eddy Maddalena,et al.  Do Easy Topics Predict Effectiveness Better Than Difficult Topics? , 2017, ECIR.

[162]  Philip J. Corriveau,et al.  Study of Rating Scales for Subjective Quality Assessment of High-Definition Video , 2011, IEEE Transactions on Broadcasting.

[163]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Track. , 2004 .

[164]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[165]  Falk Scholer,et al.  The effect of threshold priming and need for cognition on relevance calibration and assessment , 2013, SIGIR.

[166]  Shengli Wu,et al.  Data fusion with estimated weights , 2002, CIKM '02.

[167]  Josiane Mothe,et al.  Linguistic features to predict query difficulty , 2005, SIGIR 2005.

[168]  Josiane Mothe,et al.  Query Performance Prediction and Effectiveness Evaluation Without Relevance Judgments: Two Sides of the Same Coin , 2018, SIGIR.

[169]  Josiane Mothe,et al.  Why do you Think this Query is Difficult?: A User Study on Human Query Prediction , 2016, SIGIR.

[170]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[171]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[172]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[173]  Stefano Mizzaro,et al.  How many relevances in information retrieval? , 1998, Interact. Comput..

[174]  James Allan,et al.  Evaluation over thousands of queries , 2008, SIGIR '08.

[175]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[176]  Stefano Mizzaro,et al.  Effectiveness evaluation without human relevance judgments: A systematic analysis of existing methods and of their combinations , 2020, Inf. Process. Manag..

[177]  R. Feise Do multiple outcome measures require p-value adjustment? , 2002, BMC medical research methodology.

[178]  Shariq Bashir Combining pre-retrieval query quality predictors using genetic programming , 2013, Applied Intelligence.

[179]  Oren Kurland,et al.  Query-performance prediction: setting the expectations straight , 2014, SIGIR.

[180]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[181]  Mounia Lalmas,et al.  Overview of INEX 2004 , 2004, INEX.

[182]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[183]  Fernando Diaz,et al.  Performance prediction using spatial autocorrelation , 2007, SIGIR.

[184]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[185]  Nicola Ferro,et al.  Reproducibility Challenges in Information Retrieval Evaluation , 2017, ACM J. Data Inf. Qual..

[186]  Milad Shokouhi,et al.  An uncertainty-aware query selection model for evaluation of IR systems , 2012, SIGIR '12.

[187]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[188]  and software — performance evaluation , .

[189]  Peter Emerson,et al.  The original Borda count and partial voting , 2013, Soc. Choice Welf..

[190]  Djoerd Hiemstra,et al.  A Case for Automatic System Evaluation , 2010, ECIR.

[191]  Tetsuya Sakai,et al.  Topic set size design , 2015, Information Retrieval Journal.

[192]  Milad Shokouhi,et al.  Community-based bayesian aggregation models for crowdsourcing , 2014, WWW.

[193]  João Francisco Valiati,et al.  Document-level sentiment classification: An empirical comparison between SVM and ANN , 2013, Expert Syst. Appl..