Evaluating human versus machine learning performance in classifying research abstracts

We study whether humans or machine learning (ML) classification models are better at classifying scientific research abstracts according to a fixed set of discipline groups. We recruit both undergraduate and postgraduate assistants for this task in separate stages, and compare their performance against the support vectors machine ML algorithm at classifying European Research Council Starting Grant project abstracts to their actual evaluation panels, which are organised by discipline groups. On average, ML is more accurate than human classifiers, across a variety of training and test datasets, and across evaluation panels. ML classifiers trained on different training sets are also more reliable than human classifiers, meaning that different ML classifiers are more consistent in assigning the same classifications to any given abstract, compared to different human classifiers. While the top five percentile of human classifiers can outperform ML in limited cases, selection and training of such classifiers is likely costly and difficult compared to training ML models. Our results suggest ML models are a cost effective and highly accurate method for addressing problems in comparative bibliometric analysis, such as harmonising the discipline classifications of research from different funding agencies or countries.

[1]  Henk F. Moed,et al.  Measuring contextual citation impact of scientific journals , 2009, J. Informetrics.

[2]  Alan L. Porter,et al.  Clustering scientific documents with topic modeling , 2014, Scientometrics.

[3]  Henk F. Moed,et al.  Bibliometric Rankings of World Universities , 2006 .

[4]  Leah G. Nichols A topic model approach to measuring interdisciplinarity at the National Science Foundation , 2014, Scientometrics.

[5]  J.E.J. Oberski,et al.  SOME STATISTICAL ASPECTS OF CO-CITATION CLUSTER ANALYSIS AND A JUDGMENT BY PHYSICISTS. , 1988 .

[6]  Bart De Moor,et al.  Hybrid Clustering of Text Mining and Bibliometrics Applied to Journal Sets , 2009, SDM.

[7]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[8]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[9]  Henk F. Moed,et al.  Mapping of science by combined co-citation and word analysis, I. Structural aspects , 1991, J. Am. Soc. Inf. Sci..

[10]  Jean King A review of bibliometric and other science indicators and their role in research evaluation , 1987, J. Inf. Sci..

[11]  Christian Weismayer,et al.  Aspect-Based Sentiment Detection: Comparing Human Versus Automated Classifications of TripAdvisor Reviews , 2018, ENTER.

[12]  A. Šimundić,et al.  Comparison of visual vs. automated detection of lipemic, icteric and hemolyzed specimens: can we rely on a human eye? , 2009, Clinical chemistry and laboratory medicine.

[13]  A. Tjoa,et al.  Information and Communication Technologies in Tourism , 1996, Springer Vienna.

[14]  Henk F. Moed,et al.  Delimitation of scientific subfields using cognitive words from corporate addresses in scientific publications , 2005, Scientometrics.

[15]  M. Callon,et al.  From translations to problematic networks: An introduction to co-word analysis , 1983 .

[16]  Aaron A. Sorensen,et al.  Forward-looking analysis based on grants data and machine learning based research classifications as an analytical tool , 2016 .

[17]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[18]  Forrest Shull,et al.  Building empirical support for automated code smell detection , 2010, ESEM '10.

[19]  D. King The scientific impact of nations , 2004, Nature.

[20]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[21]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[22]  Henk F. Moed,et al.  Mapping of science by combined co-citation and word analysis: II: Dynamical aspects , 1991, J. Am. Soc. Inf. Sci..

[23]  Bart De Moor,et al.  Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database , 2010, J. Assoc. Inf. Sci. Technol..

[24]  Claudio Castellano,et al.  Universality of citation distributions: Toward an objective measure of scientific impact , 2008, Proceedings of the National Academy of Sciences.

[25]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[26]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[27]  Fredrik Niclas Piro,et al.  A macro analysis of productivity differences across fields: Challenges in the measurement of scientific publishing , 2013, J. Assoc. Inf. Sci. Technol..

[28]  Christina A. Freyman,et al.  Machine-learning-based classification of research grant award records , 2016 .

[29]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[30]  Bart De Moor,et al.  Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping , 2011, Scientometrics.

[31]  Benoit Macaluso Comparative Scientometric Assessment Of The Results Of ERC-Funded Projects, Bibliometric Assessment Report (D5) , 2015 .

[32]  Hsin-Chang Yang,et al.  Construction of supervised and unsupervised learning systems for multilingual text categorization , 2009, Expert Syst. Appl..

[33]  T Walter,et al.  Applying Machine Learning to Compare Research Grant Programs , 2018 .

[34]  J. R. Landis,et al.  An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. , 1977, Biometrics.