Analysis of Mexican Research Production - Exploring a Scientifical Database

This paper presents an exploring analysis of the research activity of a country using ISI web of Science Collection. We decided to focus the work on Mexican research in computer science. The aim of this text mining work is to extract the main direction in this scientific field. The focal exploring axe is: clustering. We have done two folds analysis: the first one on frequency representation of the extracted terms, and the second, much larger and difficult, on mining the document representations with the aim of finding clusters of documents, using the most used terms in the title. The cluster algorithms applied were hierarchical, kmeans, DIANA, SOM, SOTA, PAM, AGNES and model. Experiments with different number of terms and with the complete dataset were realized, but results were not satisfactory. We conclude that the best model for this type of analysis is model based, because it gives a better classification, but still it needs better performance algorithms. Results show that very few areas are developed by Mexicans.

[1]  Kurt Hornik,et al.  Text Mining Package , 2015 .

[2]  Mohsen Taheriyan,et al.  Subject classification of research papers based on interrelationships analysis , 2011, KDMS '11.

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[5]  Michiel Hazewinkel Dynamic Stochastic Models for Indexes and Thesauri, Identification Clouds, and Information Retrieval and Storage , 2005 .

[6]  Cedric E. Ginestet ggplot2: Elegant Graphics for Data Analysis , 2011 .

[7]  Jian Ma,et al.  An Ontology-Based Text-Mining Method to Cluster Proposals for Research Project Selection , 2012, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[8]  B. Björk,et al.  Anatomy of open access publishing: a study of longitudinal development and internal structure , 2012, BMC Medicine.

[9]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[10]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[13]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[14]  Mihaela Juganaru-Mathieu,et al.  Desarrollo de una aplicación destinada a la clasificacion de informacion textual y su evaluación por simulacion , 2010 .

[15]  Adrian E. Raftery,et al.  Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering , 2007, J. Classif..

[16]  Rimantas Rudzkis,et al.  Statistical Classification of Scientific Publications , 2010, Informatica.

[17]  Mostafa M. Aref,et al.  Fuzzy Document Clustering Approach using WordNet Lexical Categories , 2008, SCSS.