Exploring topics in the field of data science by analyzing wikipedia documents: A preliminary result

In this poster, topics in the field of Data Science were explored from Wikipedia documents based on clustering, principal component analysis (PCA), and topic modeling. As a pilot study, we analyzed part of the dataset of Wikipedia documents to initially identify topics discussed in Data Science. Hierarchical clustering resulted in six clusters of topics while PCA identified eleven dimensions in the Data Science field. In addition, topic modeling based on latent Dirichlet allocation (LDA) produced fifty topics related to Data Science. The researchers plan to further examine hierarchical, structural relationships between topics using structural equation modeling and social network analysis. The findings from this study will be useful to understand what topics are currently discussed in the area of Data Science.

[1]  Mehmed Kantardzic,et al.  Data Mining: Concepts, Models, Methods, and Algorithms , 2002 .

[2]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[3]  Ee-Peng Lim,et al.  Measuring article quality in wikipedia: models and evaluation , 2007, CIKM '07.

[4]  Ismael Rafols,et al.  Is science becoming more interdisciplinary? Measuring and mapping six research fields over time , 2009, Scientometrics.

[5]  Henry M. Kim,et al.  Relationships among the academic business disciplines: a multi-method citation analysis , 2006 .

[6]  D. Edwards Data Mining: Concepts, Models, Methods, and Algorithms , 2003 .

[7]  Mehmed Kantardzic,et al.  Data-Mining Concepts , 2011 .

[8]  P. Gloor,et al.  Analyzing the Creative Editing Behavior of Wikipedia Editors: Through Dynamic Social Network Analysis , 2010 .

[9]  J. Edward Jackson,et al.  A User's Guide to Principal Components: Jackson/User's Guide to Principal Components , 2004 .

[10]  Soohyung Joo,et al.  Classification of web resources using user generated terms , 2013 .

[11]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[12]  Bernardo A. Huberman,et al.  Usage patterns of collaborative tagging systems , 2006, J. Inf. Sci..

[13]  I. Mühlhauser,et al.  [Does WIKIPEDIA provide evidence-based health care information? A content analysis]. , 2008, Zeitschrift fur Evidenz, Fortbildung und Qualitat im Gesundheitswesen.

[14]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[15]  Soohyung Joo,et al.  Structural analysis of author vector space in the field of information sciences , 2012, ASIST.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Soohyung Joo,et al.  Application of structural equation modelling in exploring tag patterns: A pilot study , 2010, ASIST.

[18]  Kun Lu,et al.  Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches , 2012, J. Assoc. Inf. Sci. Technol..

[19]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[20]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..