Towards a Reliable and Robust Methodology for Crowd-Based Subjective Quality Assessment of Query-Based Extractive Text Summarization

The intrinsic and extrinsic quality evaluation is an essential part of the summary evaluation methodology usually conducted in a traditional controlled laboratory environment. However, processing large text corpora using these methods reveals expensive from both the organizational and the financial perspective. For the first time, and as a fast, scalable, and cost-effective alternative, we propose micro-task crowdsourcing to evaluate both the intrinsic and extrinsic quality of query-based extractive text summaries. To investigate the appropriateness of crowdsourcing for this task, we conduct intensive comparative crowdsourcing and laboratory experiments, evaluating nine extrinsic and intrinsic quality measures on 5-point MOS scales. Correlating results of crowd and laboratory ratings reveals high applicability of crowdsourcing for the factors overall quality, grammaticality, non-redundancy, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness. Further, we investigate the effect of the number of repetitions of assessments on the robustness of mean opinion score of crowd ratings, measured against the increase of correlation coefficients between crowd and laboratory. Our results suggest that the optimal number of repetitions in crowdsourcing setups, in which any additional repetitions do no longer cause an adequate increase of overall correlation coefficients, lies between seven and nine for intrinsic and extrinsic quality factors.

[1]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[2]  Anirban Mukhopadhyay,et al.  Quality Enhancement by Weighted Rank Aggregation of Crowd Opinion , 2017, ArXiv.

[3]  Angela Fan,et al.  Controllable Abstractive Summarization , 2017, NMT@ACL.

[4]  David G. Rand,et al.  The online laboratory: conducting experiments in a real labor market , 2010, ArXiv.

[5]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[6]  Inderjeet Mani,et al.  Summarization Evaluation: An Overview , 2001, NTCIR.

[7]  Michael S. Bernstein,et al.  The future of crowd work , 2013, CSCW.

[8]  Ani Nenkova,et al.  Automatic Evaluation of Linguistic Quality in Multi-Document Summarization , 2010, ACL.

[9]  Ido Dagan,et al.  Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation , 2019, NAACL.

[10]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[11]  Elena Lloret,et al.  The challenging task of summary evaluation: an overview , 2017, Language Resources and Evaluation.

[12]  Bob Carpenter,et al.  The Benefits of a Model of Annotation , 2013, Transactions of the Association for Computational Linguistics.

[13]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[14]  Walt Detmar Meurers,et al.  Focus Annotation of Task-based Data: Establishing the Quality of Crowd Annotation , 2016, LAW@ACL.

[15]  Hoa Trang Dang,et al.  Overview of DUC 2005 , 2005 .

[16]  Lamia Hadrich Belguith,et al.  Mix Multiple Features to Evaluate the Content and the Linguistic Quality of Text Summaries , 2017, J. Comput. Inf. Technol..

[17]  Yang Liu,et al.  Non-Expert Evaluation of Summarization Systems is Risky , 2010, Mturk@HLT-NAACL.

[18]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[19]  Iryna Gurevych,et al.  APRIL: Interactively Learning to Summarise by Combining Active Preference Learning and Reinforcement Learning , 2018, EMNLP.

[20]  Eric SanJuan,et al.  Summary Evaluation with and without References , 2010, Polytech. Open Libr. Int. Bull. Inf. Technol. Sci..

[21]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[22]  Michael S. Bernstein,et al.  Flash Organizations: Crowdsourcing Complex Work by Structuring Crowds As Organizations , 2017, CHI.

[23]  Wei Wu,et al.  Structuring, Aggregating, and Evaluating Crowdsourced Design Critique , 2015, CSCW.

[24]  David E. Irwin,et al.  Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[25]  Karel Jezek,et al.  Evaluation Measures for Text Summarization , 2012, Comput. Informatics.

[26]  Tim Polzehl,et al.  A Crowdsourcing Approach to Evaluate the Quality of Query-based Extractive Text Summaries , 2019, 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX).

[27]  Aniket Kittur,et al.  Instrumenting the crowd: using implicit behavioral measures to predict task performance , 2011, UIST.

[28]  Jeffrey Heer,et al.  Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks , 2016, CSCW.

[29]  Ujwal Gadiraju,et al.  It's getting crowded!: improving the effectiveness of microtask crowdsourcing , 2017 .

[30]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[31]  Hwee Tou Ng,et al.  Automatically Evaluating Text Coherence Using Discourse Relations , 2011, ACL.

[32]  Phuoc Tran-Gia,et al.  Best Practices for QoE Crowdtesting: QoE Assessment With Crowdsourcing , 2014, IEEE Transactions on Multimedia.

[33]  Chris Callison-Burch,et al.  Crowd control: Effectively utilizing unscreened crowd workers for biomedical data annotation , 2017, J. Biomed. Informatics.

[34]  Eric Gilbert,et al.  Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk , 2015, CHI.

[35]  Iryna Gurevych,et al.  Concept-Map-Based Multi-Document Summarization using Concept Coreference Resolution and Global Importance Optimization , 2017, IJCNLP.

[36]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[37]  John M. Conroy,et al.  Mind the Gap: Dangers of Divorcing Evaluations of Summary Content from Linguistic Quality , 2008, COLING.

[38]  Anirban Mukhopadhyay,et al.  A Review of Judgment Analysis Algorithms for Crowdsourced Opinions , 2020, IEEE Transactions on Knowledge and Data Engineering.