Co-clustering approaches to integrate lexical and bibliographical information

Terms are the building blocks to organize and access information, and hold a key position in information retrieval. In forthcoming work we have shown how a methodology of indexing full-text scientific articles combined with an exploratory statistical analysis can improve on bibliometric approaches to mapping science. Textual documents are indexed and further characterized using data mining techniques and co-word analysis. We start this paper by briefly demonstrating the text mining approach. Whereas statistical processing based on fulltext documents provides a relational view based on the topicality represented by these documents, bibliometric components can include other characteristics that describe their position in the set. Therefore we extend on previous work and explore how hybrid methodologies that deeply combine text analysis and bibliometric methods can improve the mapping of science and technology. In particular, we propose a method to mathematically combine document similarity matrices resulting from vector-based indices on the one hand, and from selected bibliometric indicators on the other hand. Weighted linear combinations as well as approaches inspired on statistical meta-analysis are presented. Both pitfalls and possible solutions are discussed. The resulting combined similarity matrix offers an attractive way to ‘co-cluster’ documents based on both lexical and bibliographic information. Introduction Bibliometric methods proved valuable tools to monitor and chart scientific processes. When considering publications as atomic entities in scientometric studies, one can readily describe and analyze the relationship between elements of a given set of scientific publications using bibliometric tools. However, lexical information may also convey important clues for such mapping purposes. Therefore, using both sources of information in a supplementary way provides interesting perspectives. The idea of combining bibliometric methods with the analysis of indexing terms, subject headings or keywords extracted from titles and/or abstracts, is not new (Callon et al., 1991; Noyons and van Raan, 1994; Zitt and Bassecoulard, 1994; Kostoff et al, 2001). In a forthcoming publication (Glenisson et al., 2005), we examine how full-text analysis by casting terms of a scientific publication in a vector space can complement more traditional bibliometric indicators for the purpose of science mapping. In this paper we present and extend the main conclusions of the previous work and propose an integrated approach to jointly mine lexical and bibliometric information. Our goal is to improve on both existing lexical and bibliometric approaches to science (or technology) mapping through a hybrid methodology that enables a ‘best of both worlds’ approach.