Research literature clustering using diffusion maps

We apply the knowledge discovery process to the mapping of current topics in a particular field of science. We are interested in how articles form clusters and what are the contents of the found clusters. A framework involving web scraping, keyword extraction, dimensionality reduction and clustering using the diffusion map algorithm is presented. We use publicly available information about articles in high-impact journals. The method should be of use to practitioners or scientists who want to overview recent research in a field of science. As a case study, we map the topics in data mining literature in the year 2011.

[1]  Barbara Kitchenham,et al.  Procedures for Performing Systematic Reviews , 2004 .

[2]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[3]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[4]  Andrei Zinovyev,et al.  Principal Manifolds for Data Visualization and Dimension Reduction , 2007 .

[5]  B. Nadler,et al.  Diffusion Maps - a Probabilistic Interpretation for Spectral Embedding and Clustering Algorithms , 2008 .

[6]  Dharminder Kumar,et al.  Rise of Data Mining: Current and Future Application Areas , 2011 .

[7]  Kevin W. Boyack,et al.  Mapping the backbone of science , 2004, Scientometrics.

[8]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[9]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[10]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[11]  Alan L. Porter,et al.  Overlay Maps of Science: a New Tool for Research Policy , 2009 .

[12]  Josiane Mothe,et al.  TetraFusion: information discovery on the Internet , 1999, IEEE Intell. Syst..

[13]  Alan L. Porter,et al.  Science overlay maps: A new tool for research policy and library management , 2009, J. Assoc. Inf. Sci. Technol..

[14]  Brian Everitt,et al.  Cluster analysis , 1974 .

[15]  Chaomei Chen,et al.  CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature , 2006, J. Assoc. Inf. Sci. Technol..

[16]  Ed C. M. Noyons,et al.  A unified approach to mapping and clustering of bibliometric networks , 2010, J. Informetrics.

[17]  Pearl Brereton,et al.  Using Mapping Studies in Software Engineering , 2008, PPIG.

[18]  Ann B. Lee,et al.  Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  E. Garfield Citation analysis as a tool in journal evaluation. , 1972, Science.

[20]  Loet Leydesdorff,et al.  Clusters and Maps of Science Journals Based on Bi-connected Graphs in the Journal Citation Reports , 2009, ArXiv.

[21]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[22]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[23]  Huan Liu,et al.  Research Paper Recommender Systems: A Subspace Clustering Approach , 2005, WAIM.

[24]  William R. Hersh,et al.  Reducing workload in systematic review preparation using automated citation classification. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[25]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[26]  N. Mohaghegh,et al.  WHY THE IMPACT FACTOR OF JOURNALS SHOULD NOT BE USED FOR EVALUATING RESEARCH , 2005 .

[27]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[28]  M. Callon,et al.  From translations to problematic networks: An introduction to co-word analysis , 1983 .

[29]  W. Glänzel BIBLIOMETRICS AS A RESEARCH FIELD A course on theory and application of bibliometric indicators , 2003 .

[30]  James Bailey,et al.  Document clustering of scientific texts using citation contexts , 2010, Information Retrieval.

[31]  Yuen-Hsien Tseng,et al.  Journal clustering of library and information science for subfield delineation using the bibliometric analysis toolkit: CATAR , 2013, Scientometrics.

[32]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[33]  Hans-Peter Kriegel,et al.  Future trends in data mining , 2007, Data Mining and Knowledge Discovery.

[34]  Lokanatha C. Reddy,et al.  A Review on Data mining from Past to the Future , 2011 .

[35]  Andrzej Janusz,et al.  Semantic Clustering of Scientific Articles with Use of DBpedia Knowledge Base , 2012, Intelligent Tools for Building a Scientific Information Platform.

[36]  Ludmila E. Ivancheva,et al.  Scientometrics Today: A Methodological Overview , 2008 .

[37]  Robert Bembenik,et al.  Intelligent Tools for Building a Scientific Information Platform , 2013, Intelligent Tools for Building a Scientific Information Platform.

[38]  Stan Matwin,et al.  A new algorithm for reducing the workload of experts in performing systematic reviews , 2010, J. Am. Medical Informatics Assoc..

[39]  Ismael Rafols,et al.  Global maps of science based on the new Web-of-Science categories , 2012, Scientometrics.

[40]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[41]  Ismael Rafols,et al.  A global map of science based on the ISI subject categories , 2009, J. Assoc. Inf. Sci. Technol..

[42]  C.O.S. Sorzano,et al.  Clustering of biomedical scientific papers , 2009, 2009 IEEE International Symposium on Intelligent Signal Processing.