Funding map using paragraph embedding based on semantic diversity

Maps of science representing the structure of science can help us understand science and technology (S&T) development. Studies have thus developed techniques for analyzing research activities’ relationships; however, ongoing research projects and recently published papers have difficulty in applying inter-citation and co-citation analysis. Therefore, in order to characterize what is currently being attempted in the scientific landscape, this paper proposes a new content-based method of locating research projects in a multi-dimensional space using the recent word/paragraph embedding techniques. Specifically, for addressing an unclustered problem associated with the original paragraph vectors, we introduce paragraph vectors based on the information entropies of concepts in an S&T thesaurus. The experimental results show that the proposed method successfully formed a clustered map from 25,607 project descriptions of the 7th Framework Programme of EU from 2006 to 2016 and 34,192 project descriptions of the National Science Foundation from 2012 to 2016.

[1]  Takahiro Kawamura,et al.  J-GLOBAL knowledge: Japan's Largest Linked Open Data for Science and Technology , 2015, SEMWEB.

[2]  Andrew McCallum,et al.  Database of NIH grants using machine-learned categories and graphical clustering , 2011, Nature Methods.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Andrea Scharnhorst,et al.  Contextualization of topics: browsing through the universe of bibliographic information , 2017, Scientometrics.

[5]  S. Butler,et al.  Typologies of Prescription Opioid Use in a Large Sample of Adults Assessed for Substance Abuse Treatment , 2011, PloS one.

[6]  Kevin W. Boyack,et al.  Improving the accuracy of co-citation clustering using full text , 2013, J. Assoc. Inf. Sci. Technol..

[7]  Gully A. P. C. Burns,et al.  The NIH Visual Browser: An Interactive Visualization of Biomedical Research , 2009, 2009 13th International Conference Information Visualisation.

[8]  Takahiro Kawamura,et al.  Funding Map for Research Project Relationships using Paragraph Vectors , 2017, ISSI.

[9]  Kevin W. Boyack,et al.  Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge? , 2015, J. Assoc. Inf. Sci. Technol..

[10]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[11]  Kevin W. Boyack,et al.  Research Portfolio Analysis and Topic Prominence , 2017, J. Informetrics.

[12]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[14]  Kevin W. Boyack,et al.  Mapping the backbone of science , 2004, Scientometrics.

[15]  Kevin W. Boyack,et al.  A principled methodology for comparing relatedness measures for clustering publications , 2019, ISSI.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[18]  Qin Lu,et al.  Chasing Hypernyms in Vector Spaces with Entropy , 2014, EACL.

[19]  D J PRICE,et al.  NETWORKS OF SCIENTIFIC PAPERS. , 1965, Science.

[20]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[21]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[22]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[23]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[24]  Per Ahlgren,et al.  Document-document similarity approaches and science mapping: Experimental comparison of five approaches , 2009, J. Informetrics.

[25]  Rob Koopman,et al.  Clustering articles based on semantic similarity , 2017, Scientometrics.

[26]  Kevin W. Boyack,et al.  Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches , 2011, PloS one.

[27]  Andrew McCallum,et al.  Word Representations via Gaussian Embedding , 2014, ICLR.

[28]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[29]  JonesK. Sparck,et al.  A probabilistic model of information retrieval , 2000 .

[30]  Andrea Scharnhorst,et al.  Mapping EINS -- An exercise in mapping the Network of Excellence in Internet Science , 2013, ArXiv.