Machine learning misclassification of academic publications reveals non-trivial interdependencies of scientific disciplines

Exploring the production of knowledge with quantitative methods is the foundation of scientometrics. In an application of machine learning to scientometrics, we here consider the classification problem of the mapping of academic publications to the subcategories of a multidisciplinary journal—and hence to scientific disciplines—based on the information contained in the abstract. In contrast to standard classification tasks, we are not interested in maximizing the accuracy, but rather we ask, whether the failures of an automatic classification are systematic and contain information about the system under investigation. These failures can be represented as a ’misclassification network’ inter-relating scientific disciplines. Here we show that this misclassification network (1) gives a markedly different pattern of interdependencies among scientific disciplines than common ’maps of science’, (2) reveals a statistical association between misclassification and citation frequencies, and (3) allows disciplines to be classified as ’method lenders’ and ’content explorers’, based on their in-degree out-degree asymmetry. On a more general level, in a wide range of machine learning applications misclassification networks have the potential of extracting systemic information from the failed classifications, thus allowing to visualize and quantitatively assess those aspects of a complex system, which are not machine learnable.

[1]  Marc-Thorsten Hütt,et al.  Managing workflow of customer requirements using machine learning , 2019, Comput. Ind..

[2]  Marc-Thorsten Hütt,et al.  Patterns of success in co-authorship networks are highly sensitive to author disambiguation , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Henry G. Small,et al.  Maps of science as interdisciplinary discourse: co-citation contexts and the role of analogy , 2010, Scientometrics.

[4]  James A. Evans,et al.  Large teams develop and small teams disrupt science and technology , 2019, Nature.

[5]  S. Redner How popular is your paper? An empirical study of the citation distribution , 1998, cond-mat/9804163.

[6]  D J PRICE,et al.  NETWORKS OF SCIENTIFIC PAPERS. , 1965, Science.

[7]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[8]  Loet Leydesdorff,et al.  A review of theory and practice in scientometrics , 2015, Eur. J. Oper. Res..

[9]  R. Tate,et al.  Does Traumatic Brain Injury Lead to Criminality? A Whole-Population Retrospective Cohort Study Using Linked Data , 2015, PloS one.

[10]  Ismael Rafols,et al.  A global map of science based on the ISI subject categories , 2009, J. Assoc. Inf. Sci. Technol..

[11]  Ismael Rafols,et al.  A global map of science based on the ISI subject categories , 2009 .

[12]  Ovidiu Ivanciuc,et al.  Weka machine learning for predicting the phospholipidosis inducing potential. , 2008, Current topics in medicinal chemistry.

[13]  I. Ràfols,et al.  Does Interdisciplinary Research Lead to Higher Citation Impact? The Different Effect of Proximal and Distal Interdisciplinarity , 2015, PloS one.

[14]  K. Sneppen,et al.  Specificity and Stability in Topology of Protein Networks , 2002, Science.

[15]  Filippo Radicchi,et al.  Changing demographics of scientific careers: The rise of the temporary workforce , 2018, Proceedings of the National Academy of Sciences.

[16]  L. Krumov,et al.  Motifs in co-authorship networks and their relation to the impact of scientific publications , 2011 .

[17]  Kevin W. Boyack,et al.  Comparison of topic extraction approaches and their results , 2017, Scientometrics.

[18]  Carl T. Bergstrom,et al.  The Science of Science , 2018, Science.

[19]  Alejandro Medina,et al.  On entropy research analysis: cross-disciplinary knowledge transfer , 2018, Scientometrics.

[20]  Qian-Jin Zong,et al.  Doctoral dissertations of Library and Information Science in China: A co-word analysis , 2012, Scientometrics.

[21]  Michael Derntl,et al.  Basics of research paper writing and publishing , 2014 .

[22]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[23]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[24]  Marc-Thorsten Hütt,et al.  Drawing a map of invasion biology based on a network of hypotheses , 2018 .

[25]  Benjamin F. Jones,et al.  Supporting Online Material Materials and Methods Figs. S1 to S3 References the Increasing Dominance of Teams in Production of Knowledge , 2022 .

[26]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[27]  C. Peterson,et al.  Topological properties of citation and metabolic networks. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[28]  Katy Börner,et al.  Scientific progress despite irreproducibility: A seeming paradox , 2017, Proceedings of the National Academy of Sciences.

[29]  C. Tsallis,et al.  Are citations of scientific papers a case of nonextensivity? , 1999, cond-mat/9903433.

[30]  Roger Guimerà,et al.  Team Assembly Mechanisms Determine Collaboration Network Structure and Team Performance , 2005, Science.

[31]  Loet Leydesdorff,et al.  The Challenge of Scientometrics: The Development, Measurement, and Self-Organization of Scientific Communications , 2001 .

[32]  Yi Zhang,et al.  Scientific evolutionary pathways: Identifying and visualizing relationships for scientific topics , 2017, J. Assoc. Inf. Sci. Technol..

[33]  Alan L. Porter,et al.  Clustering scientific documents with topic modeling , 2014, Scientometrics.

[34]  Kevin W. Boyack,et al.  Mapping the backbone of science , 2004, Scientometrics.

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  M. Markus,et al.  On-off intermittency and intermingledlike basins in a granular medium. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[37]  Benjamin Grosser,et al.  What Do Metrics Want? How Quantification Prescribes Social Interaction on Facebook , 2014 .

[38]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[39]  Daniel B. Larremore,et al.  Productivity, prominence, and the effects of academic environment , 2019, Proceedings of the National Academy of Sciences.

[40]  Katy Börner,et al.  A Multi-Level Systems Perspective for the Science of Team Science , 2010, Science Translational Medicine.

[41]  Arho Suominen,et al.  Modeling : Comparison of Unsupervised Learning and Human-Assigned Subject Classification , 2015 .

[42]  Qing Ke,et al.  Defining and identifying Sleeping Beauties in science , 2015, Proceedings of the National Academy of Sciences.

[43]  Yifang Ma,et al.  Scientific prize network predicts who pushes the boundaries of science , 2018, Proceedings of the National Academy of Sciences.

[44]  Alexander S Mikhailov,et al.  Evolutionary reconstruction of networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[45]  Derek de Solla Price,et al.  A general theory of bibliometric and other cumulative advantage processes , 1976, J. Am. Soc. Inf. Sci..

[46]  M. Sales-Pardo,et al.  Effectiveness of Journal Ranking Schemes as a Tool for Locating Information , 2008, PloS one.

[47]  Caleb M Trujillo,et al.  Document co-citation analysis to enhance transdisciplinary research , 2018, Science Advances.