Clustering articles based on semantic similarity

Document clustering is generally the first step for topic identification. Since many clustering methods operate on the similarities between documents, it is important to build representations of these documents which keep their semantics as much as possible and are also suitable for efficient similarity calculation. As we describe in Koopman et al. (Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse. http://www.issi2015.org/files/downloads/all-papers/1042.pdf, 2015), the metadata of articles in the Astro dataset contribute to a semantic matrix, which uses a vector space to capture the semantics of entities derived from these articles and consequently supports the contextual exploration of these entities in LittleAriadne. However, this semantic matrix does not allow to calculate similarities between articles directly. In this paper, we will describe in detail how we build a semantic representation for an article from the entities that are associated with it. Base on such semantic representations of articles, we apply two standard clustering methods, K-Means and the Louvain community detection algorithm, which leads to our two clustering solutions labelled as OCLC-31 (standing for K-Means) and OCLC-Louvain (standing for Louvain). In this paper, we will give the implementation details and a basic comparison with other clustering solutions that are reported in this special issue.

[1]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[2]  Eugene Garfield,et al.  Citation indexing - its theory and application in science, technology, and humanities , 1979 .

[3]  ScharnhorstAndrea,et al.  Same data--different results? Towards a comparative approach to the identification of thematic structures in science , 2017 .

[4]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[5]  Wolfgang Glänzel,et al.  Subject clustering analysis based on ISI category classification , 2010, J. Informetrics.

[6]  魏屹东,et al.  Scientometrics , 2018, Encyclopedia of Big Data.

[7]  Cassidy R. Sugimoto,et al.  The kaleidoscope of disciplinarity , 2015, J. Documentation.

[8]  S. T. Dumais,et al.  Human factors and behavioral science: Statistical semantics: Analysis of the potential performance of key-word information systems , 1983, The Bell System Technical Journal.

[9]  Andrea Scharnhorst,et al.  Contextualization of Topics - Browsing through Terms, Authors, Journals and Cluster Allocations , 2015, ISSI.

[10]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[11]  Gwenn Englebienne,et al.  Ariadne's Thread: Interactive Navigation in a World of Networked Information , 2015, CHI Extended Abstracts.

[12]  Andrea Bergmann,et al.  Citation Indexing Its Theory And Application In Science Technology And Humanities , 2016 .

[13]  Wolfgang Glänzel,et al.  A new methodological approach to bibliographic coupling and its application to the national, regional and institutional level , 2005, Scientometrics.

[14]  Yuhong Yang,et al.  Information Theory, Inference, and Learning Algorithms , 2005 .

[15]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[16]  Wolfgang Glänzel,et al.  Using hybrid methods and ‘core documents’ for the representation of clusters and topics: the astronomy dataset , 2017, Scientometrics.

[17]  Kevin W. Boyack,et al.  Improving the accuracy of co-citation clustering using full text , 2013, J. Assoc. Inf. Sci. Technol..

[18]  Javier Béjar Alonso,et al.  K-means vs Mini Batch K-means: a comparison , 2013 .

[19]  Arie Rip,et al.  Co-word maps of biotechnology: An example of cognitive scientometrics , 1984, Scientometrics.

[20]  Kevin W. Boyack,et al.  Comparison of topic extraction approaches and their results , 2017, Scientometrics.

[21]  W. N. Locke,et al.  Machine Translation of Languages , 1956 .

[22]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[23]  Magnus Sahlgren,et al.  The Distributional Hypothesis , 2008 .

[24]  Kevin W. Boyack,et al.  Mapping the backbone of science , 2004, Scientometrics.

[25]  Loet Leydesdorff,et al.  Measuring the meaning of words in contexts: An automated analysis of controversies about 'Monarch butterflies,' 'Frankenfoods,' and 'stem cells' , 2006, Scientometrics.

[26]  R. Darnell Translation , 1873, The Indian medical gazette.

[27]  Andrea Scharnhorst,et al.  Contextualization of topics: browsing through the universe of bibliographic information , 2017, Scientometrics.

[28]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[29]  Susan T. Dumais,et al.  Statistical semantics: analysis of the potential performance of keyword information systems , 1984 .

[30]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[31]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[32]  Ian Witten,et al.  Data Mining , 2000 .

[33]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[34]  Loet Leydesdroff Words and co-words as indicators of intellectual organization , 1989 .

[35]  Loet Leydesdorff Words and co-words as indicators of intellectual organization , 1989 .

[36]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[37]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[38]  Wolfgang Glänzel,et al.  Same data—different results? Towards a comparative approach to the identification of thematic structures in science , 2017, Scientometrics.

[39]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[40]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Wolfgang Glänzel,et al.  Same data—different results? Towards a comparative approach to the identification of thematic structures in science : Introduction to the special issue , 2017 .

[42]  Werner Ebeling,et al.  The application of evolution models in scientometrics , 2005, Scientometrics.