A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis

Abstract We measure the knowledge flows between countries by analysing publication and citation data, arguing that not all citations are equally important. Therefore, in contrast to existing techniques that utilize absolute citation counts to quantify knowledge flows between different entities, our model employs a citation context analysis technique, using a machine-learning approach to distinguish between important and non-important citations. We use 14 novel features (including context-based, cue words-based and text-based) to train a Support Vector Machine (SVM) and Random Forest classifier on an annotated dataset of 20,527 publications downloaded from the Association for Computational Linguistics anthology (http://allenai.org/data.html). Our machine-learning models outperform existing state-of-the-art citation context approaches, with the SVM model reaching up to 61% and the Random Forest model up to a very encouraging 90% Precision–Recall Area Under the Curve, with 10-fold cross-validation. Finally, we present a case study to explain our deployed method for datasets of PLoS ONE full-text publications in the field of Computer and Information Sciences. Our results show that a significant volume of knowledge flows from the United States, based on important citations, are consumed by the international scientific community. Of the total knowledge flow from China, we find a relatively smaller proportion (only 4.11%) falling into the category of knowledge flow based on important citations, while The Netherlands and Germany show the highest proportions of knowledge flows based on important citations, at 9.06 and 7.35% respectively. Among the institutions, interestingly, the findings show that at the University of Malaya more than 10% of the knowledge produced falls into the category of important. We believe that such analyses are helpful to understand the dynamics of the relevant knowledge flows across nations and institutions.

[1]  Kimberly S. Hamilton,et al.  The changing composition of innovative activity in the US -- a portrait based on patent analysis , 2001 .

[2]  Peter Haddawy,et al.  Identifying Important Citations Using Contextual Information from Full Text , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[3]  Mihail C. Roco,et al.  Nanotechnology knowledge diffusion: measuring the impact of the research networking and a strategy for improvement , 2014, Journal of Nanoparticle Research.

[4]  Terttu Luukkonen,et al.  Is scientists' publishing behaviour rewardseeking? , 1992, Scientometrics.

[5]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[6]  J. Ziman,et al.  Public knowledge. An essay concerning the social dimension of science , 1970, Medical History.

[7]  Hadi Sharif Moghaddam,et al.  Intellectual structure of knowledge in iMetrics: A co-word analysis , 2017, Inf. Process. Manag..

[8]  Cassidy R. Sugimoto,et al.  Institutional interactions: Exploring social, cognitive, and geographic relationships between institutions as demonstrated through citation networks , 2011, J. Assoc. Inf. Sci. Technol..

[9]  Hai Zhuge,et al.  Abstraction and analogy in cognitive space: A software process model , 1997, Inf. Softw. Technol..

[10]  Martin Meyer,et al.  RETRACTED ARTICLE: Tracing Knowledge Flows in Innovation Systems—an Informetric Perspective on Future Research Science-based Innovation , 2002 .

[11]  Susan Bonzi,et al.  Characteristics of a Literature as Predictors of Relatedness Between Cited and Citing Works , 2007, J. Am. Soc. Inf. Sci..

[12]  Pari Patel Indicatiors for systems of innovation and system interactions: Technological collaboration and inter-active learning , 1998 .

[13]  Loet Leydesdorff,et al.  The delineation of an interdisciplinary specialty in terms of a journal set: The case of communication studies , 2009 .

[14]  Erjia Yan,et al.  Research dynamics, impact, and dissemination: A topic‐level analysis , 2015, J. Assoc. Inf. Sci. Technol..

[15]  Peter Haddawy,et al.  Analyzing knowledge flows of scientific literature through semantic links: a case study in the field of energy , 2015, Scientometrics.

[16]  Weimao Ke,et al.  Mapping the diffusion of scholarly knowledge among major U.S. research institutions , 2006, Scientometrics.

[17]  Chaomei Chen,et al.  Where are citations located in the body of scientific articles? A study of the distributions of citation locations , 2013, J. Informetrics.

[18]  Leonardo Costa Ribeiro,et al.  A methodology for unveiling global innovation networks: patent citations as clues to cross border knowledge flows , 2014, Scientometrics.

[19]  Feifei Wang,et al.  Visualizing information science: Author direct citation analysis in China and around the world , 2015, J. Informetrics.

[20]  Simone Teufel,et al.  Automatic classification of citation function , 2006, EMNLP.

[21]  C. Borgman,et al.  Scholarly Communication and Bibliometrics. , 1992 .

[22]  Abagail McWilliams,et al.  The Balance of Trade Between Disciplines , 2005 .

[23]  Hannes Toivanen,et al.  Knowledge flows and bases in emerging economy innovation systems: Brazilian research 2005–2009 , 2014 .

[24]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[25]  Jie Yu,et al.  Building Web Knowledge Flows based on Interactive Computing with Semantics , 2010, New Generation Computing.

[26]  Marcelo Mendoza,et al.  The research space: using career paths to predict the evolution of the research output of individuals, institutions, and nations , 2016, Scientometrics.

[27]  Mahendra V. Mete,et al.  CITATION ANALYSIS OF 'ANNALS OF LIBRARY SCIENCE AND DOCUMENTATION' , 1996 .

[28]  M. Moravcsik,et al.  Some Results on the Function and Quality of Citations , 1975 .

[29]  Hai Zhuge,et al.  Semantic linking through spaces for cyber-physical-socio intelligence: A methodology , 2011, Artif. Intell..

[30]  Oren Etzioni,et al.  Identifying Meaningful Citations , 2015, AAAI Workshop: Scholarly Big Data.

[31]  Charles Oppenheim,et al.  Highly cited old papers and the reasons why they continue to be cited , 1978, J. Am. Soc. Inf. Sci..

[32]  Henry G. Small,et al.  The synthesis of specialty narratives from co-citation clusters , 1986, J. Am. Soc. Inf. Sci..

[33]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[34]  S. Stigler Citation Patterns in the Journals of Statistics and Probability , 1994 .

[35]  Jie Yu,et al.  Generation of similarity knowledge flow for intelligent browsing based on semantic link networks , 2009, Concurr. Comput. Pract. Exp..

[36]  Peter Ingwersen,et al.  Applying diachronic citation analysis to ongoing research program evaluations , 2000 .

[37]  Hai Zhuge,et al.  Communities and Emerging Semantics in Semantic Link Network: Discovery and Learning , 2009, IEEE Transactions on Knowledge and Data Engineering.

[38]  Peter Haddawy,et al.  Measuring international knowledge flows and scholarly impact of scientific research , 2012, Scientometrics.

[39]  Thomas Schøtt,et al.  Collaboration in the Invention of Technology: Globalization, Regions, and Centers , 1994 .

[40]  Ronald E. Rice,et al.  The Convergence of Information Science and Communication: A Bibliometric Analysis , 1992, J. Am. Soc. Inf. Sci..

[41]  Hai Zhuge,et al.  Interactive semantics , 2010, Artif. Intell..

[42]  S. Dou,et al.  One-pot aqueous synthesis of cysteine-capped CdTe/CdS core–shell nanowires , 2014, Journal of Nanoparticle Research.

[43]  Erjia Yan,et al.  Disciplinary knowledge production and diffusion in science , 2016, J. Assoc. Inf. Sci. Technol..

[44]  Ying Ding,et al.  A bird's-eye view of scientific trading: Dependency relations among fields of science , 2012, J. Informetrics.

[45]  Hai Zhuge,et al.  Discovery of knowledge flow in science , 2006, CACM.

[46]  Saeed-Ul Hassan,et al.  Measuring Scientific Knowledge Flows by Deploying Citation Context Analysis using Machine Learning Approach on PLoS ONE Full Text , 2017, ISSI.

[47]  Jonathan Furner,et al.  Scholarly communication and bibliometrics , 2005, Annu. Rev. Inf. Sci. Technol..

[48]  Daryl E. Chubin,et al.  Content Analysis of References: Adjunct or Alternative to Citation Counting? , 1975 .

[49]  Alessandro Vespignani,et al.  Characterizing production and consumption in Physics , 2013 .

[50]  Peter Haddawy,et al.  Tapping into Scientific Knowledge Flows via Semantic Links , 2015, ISSI.

[51]  Chaomei Chen,et al.  The proximity of co-citation , 2011, Scientometrics.