scriptLattes: an open-source knowledge extraction system from the Lattes platform

The Lattes platform is the major scientific information system maintained by the National Council for Scientific and Technological Development (CNPq). This platform allows to manage the curricular information of researchers and institutions working in Brazil based on the so called Lattes Curriculum. However, the public information is individually available for each researcher, not providing the automatic creation of reports of several scientific productions for research groups. It is thus difficult to extract and to summarize useful knowledge for medium to large size groups of researchers. This paper describes the design, implementation and experiences with scriptLattes: an open-source system to create academic reports of groups based on curricula of the Lattes Database. The scriptLattes system is composed by the following modules: (a) data selection, (b) data preprocessing, (c) redundancy treatment, (d) collaboration graph generation among group members, (e) research map generation based on geographical information, and (f) automatic report creation of bibliographical, technical and artistic production, and academic supervisions. The system has been extensively tested for a large variety of research groups of Brazilian institutions, and the generated reports have shown an alternative to easily extract knowledge from data in the context of Lattes platform. The source code, usage instructions and examples are available at http://scriptlattes.sourceforge.net/.

[1]  Scott Nicholson,et al.  The basis for bibliomining: Frameworks for bringing together usage-based data mining and bibliometrics through data warehousing in digital library services , 2006, Inf. Process. Manag..

[2]  Fernanda A. da Fonseca Sobral,et al.  AS LIDERANÇAS CIENTÍFICAS , 2008 .

[3]  Yehuda Koren,et al.  Measuring and extracting proximity in networks , 2006, KDD '06.

[4]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  Chaomei Chen,et al.  Visualizing knowledge domains , 2005, Annu. Rev. Inf. Sci. Technol..

[6]  Ian Gorton,et al.  The Changing Paradigm of Data-Intensive Computing , 2009, Computer.

[7]  Ricardo Miranda Barcia,et al.  A análise de redes de colaboração científica sob as novas tecnologias de informação e comunicação: um estudo na Plataforma Lattes , 2005 .

[8]  Jörg Sander,et al.  Analysis of SIGMOD's co-authorship graph , 2003, SGMD.

[9]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[10]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[11]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[12]  Cristiane V Amorin [Curriculum vitae organization: the Lattes software platform]. , 2003, Pesquisa odontologica brasileira = Brazilian oral research.

[13]  I. Jolliffe Principal Component Analysis , 2002 .

[14]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[15]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[16]  Tony Hey,et al.  The Fourth Paradigm , 2009 .

[17]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[18]  L. da F. Costa,et al.  Characterization of complex networks: A survey of measurements , 2005, cond-mat/0505185.

[19]  Sônia Elisa Caregnato,et al.  Co-autoria como indicador de redes de colaboração científica , 2008 .

[20]  Cristiane V. Amorin Organização do currículo: plataforma Lattes , 2003 .

[21]  David G. Stork,et al.  Pattern Classification , 1973 .

[22]  Johan Bollen,et al.  Co-authorship networks in the digital library research community , 2005, Inf. Process. Manag..

[23]  Edward J. Wegman,et al.  Social networks of author-coauthor relationships , 2008, Comput. Stat. Data Anal..

[24]  Carlos José Pereira de Lucena,et al.  Assessing the research and education quality of the top Brazilian Computer Science graduate programs , 2008, SGCS.

[25]  Shih-Hung Wu,et al.  A knowledge-based approach to citation extraction , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..

[26]  Vinícius Medina Kern,et al.  Uma ontologia comum para a integração de bases de informações e conhecimento sobre ciência e tecnologia , 2001 .

[27]  André Casado Castaño Populando ontologias através de informações em HTML - o caso do currículo lattes , 2008 .

[28]  Haim Levkowitz,et al.  Least Square Projection: A Fast High-Precision Multidimensional Projection Technique and Its Application to Document Mapping , 2008, IEEE Transactions on Visualization and Computer Graphics.