Towards the Automated Evaluation of Crowd Work: Machine-Learning Based Classification of Complex Texts Simplified by Laymen

The work paradigm of crowd sourcing holds huge potential for organizations by providing access to a large workforce. However, an increase of crowd work entails increasing effort to evaluate the quality of the submissions. As evaluations by experts are inefficient, time-consuming, expensive, and are not guaranteed to be effective, our paper presents a concept for an automated classification process for crowd work. Using the example of crowd generated patent transcripts we build on interdisciplinary research to present an approach to classifying them along two dimensions - correctness and readability. To achieve this, we identify and select text attributes from different disciplines as input for machine-learning classification algorithms and evaluate the suitability of three well regarded algorithms, Neural Networks, Support Vector Machines and k-Nearest Neighbor algorithms. Key findings are that the proposed classification approach is feasible and the SVM classifier performs best in our experiment.

[1]  Daniela B. Friedman,et al.  A Systematic Review of Readability and Comprehension Instruments Used for Print and Web-Based Cancer Information , 2006, Health education & behavior : the official publication of the Society for Public Health Education.

[2]  B. Lang,et al.  Efficient optimization of support vector machine learning parameters for unbalanced datasets , 2006 .

[3]  Lee Gillam,et al.  The Linguistics of Readability: The Next Step for Word Processing , 2010, HLT-NAACL 2010.

[4]  W. Scott Spangler,et al.  Assessing patent value through advanced text analysis , 2007, ICAIL.

[5]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Jacques Guyot,et al.  Automated Patent Classification , 2011, Current Challenges in Patent Information Retrieval.

[8]  Ben Carterette,et al.  Overview of Information Retrieval Evaluation , 2011, Current Challenges in Patent Information Retrieval.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  R. Gunning The Technique of Clear Writing. , 1968 .

[11]  Warren S. Sarle,et al.  Neural Networks and Statistical Models , 1994 .

[12]  N. Bodor,et al.  Neural network studies: Part 3. Prediction of partition coefficients , 1994 .

[13]  R. Katz,et al.  Investigating the Not Invented Here (NIH) syndrome: A look at the performance, tenure, and communication patterns of 50 R & D Project Groups , 1982 .

[14]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[15]  Alex F. DeNoble,et al.  Review panel consensus and post-decision commercial performance: a study of early stage technologies , 2010 .

[16]  Jan Marco Leimeister,et al.  Rating Scales for Collective Intelligence in Innovation Communities: Why Quick and Easy Decision Making Does Not Get it Right , 2010, ICIS.

[17]  Makoto Iwayama,et al.  Patent Claim Processing for Readability - Structure Analysis and Term Explanation , 2003, ACL 2003.

[18]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[19]  Kurt Hornik,et al.  The support vector machine under test , 2003, Neurocomputing.

[20]  Michael Vitale,et al.  The Wisdom of Crowds , 2015, Cell.

[21]  Gabriel Jacobs,et al.  Vocabulary and Neural Networks in the Computational Assessment of Texts Written by Second-Language Learners , 2000 .

[22]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[23]  B. Park,et al.  Choice of neighbor order in nearest-neighbor classification , 2008, 0810.5276.

[24]  C. Fellbaum An Electronic Lexical Database , 1998 .

[25]  John Tait,et al.  Current Challenges in Patent Information Retrieval , 2011, The Information Retrieval Series.

[26]  Arthur C. Graesser,et al.  Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[27]  Kathrin M. Möslein,et al.  Open Evaluation: ein IT-basierter Ansatz für die Bewertung innovativer Konzepte , 2010, HMD Praxis der Wirtschaftsinformatik.

[28]  Kevyn Collins-Thompson,et al.  An Analysis of Statistical Models and Features for Reading Difficulty Prediction , 2008, ACL 2008.

[29]  Dik Lun Lee,et al.  Feature reduction for neural network based text categorization , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[30]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[31]  Karim R. Lakhani,et al.  Marginality and Problem-Solving Effectiveness in Broadcast Search , 2010, Organ. Sci..

[32]  Ian Witten,et al.  Data Mining , 2000 .

[33]  Igor V. Tetko,et al.  Neural network studies, 1. Comparison of overfitting and overtraining , 1995, J. Chem. Inf. Comput. Sci..

[34]  Suzan Verberne,et al.  Phrase-Based Document Categorization , 2011, Current Challenges in Patent Information Retrieval.

[35]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[36]  Richard Bache Measuring and Improving Access to the Corpus , 2011, Current Challenges in Patent Information Retrieval.

[37]  Caroline Gasperin,et al.  Revisiting the Readability Assessment of Texts in Portuguese , 2010, IBERAMIA.

[38]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[39]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[40]  Danielle S. McNamara,et al.  Using Coh-Metrix to assess differences between English language varieties , 2007 .

[41]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[42]  Kathrin M. Möslein,et al.  Innovation Mobs - Unlocking the Innovation Potential of Virtual Communities , 2009, AMCIS.

[43]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[44]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..