Comparative study between first and all-author co-citation analysis based on citation indexes generated from XML data

The study presents a comparative analysis between first and all-author co-citation analyses, as well as comparison between two matrix generation approaches. We thus continue the latest research in author co-citation analysis (ACA), where the results of the traditional first-author analyses based on ISI citation indexes are challenged by incorporating all-authors from the cited references. Identifying all cited authors from references in source papers is an extremely cumbersome process if the Thomson ISI citation indexes are used as a basis. Due to the difficulty in obtaining all-author co-citation data few such studies exist. In order to study all-authors co-citation we use a citation index generated from documents in XML code. This allows us to carry out a comparative study between first and all-author co-citation analyses based on the hitherto largest set of references and the broadest domain of research. Introduction Author co-citation analysis (ACA), introduced by White and Griffith (1981), is a technique for mapping the ‘intellectual structure’ of a research field, where the latter is defined as a coherent literature set. The intellectual structure is mapped from the oeuvres of the most cited and co-cited first authors in a particular literature set. Since its introduction, ACA has become a popular and much used technique. However, recently a debate concerning methodical procedures in ACA has emerged. Especially, the approach to ACA developed at Drexel University (e.g., White & Griffith, 1981; McCain, 1990) has been the focus of the current debate. Essentially, four methodical issues have been debated: 1) scalability (e.g., Chen, 1999), 2) units of analysis and their definition (e.g., Persson, 2001; Zhao, 2006; Rousseau & Zuccala, 2004), 3) the choice of proximity measures (e.g., Ahlgren, Jarneving, & Rousseau, 2003; Schneider & Borlund, 2007a; 2007b), and most recently 4) generation and transformation of matrices (Leydesdorff & Vaughan, 2006; Schneider & Borlund, 2007a). The present paper addresses the second and fourth issues in a comparative study of first and all-author cocitation analysis based on different matrix generation approaches in structured XML documents that allow for the construction of ad-hoc citation indexes. The paper is structured as follows. The following section discusses briefly previous research on all-author co-citation analyses and matrix generation. The proceeding section describes the research method of the study, i.e., data collection and data analysis. The next section presents and discusses the results, and the contribution ends with a conclusion. Previous Work on All-author Co-citations and matrix generation In several respects, the methodical approach to ACA developed at Drexel University has been shaped by specific technical features that have seemingly brought some constraints to the ACA methodology. Most important is the dependence upon the standardized cited reference strings in Thompson ISI’s citation indexes, and the use of the SPSS statistical package as the tool for multivariate analyses. The most obvious example is that the cited reference strings only allows for first authors as units of analysis in ACA. As a result, ACA methodology only takes into account first authors in the definition of author co-citation counts. Two authors are considered to be co-cited when at least one document from each author’s oeuvre occurs in the same reference list of a citing document, where an author’s oeuvre is defined as all the works with the author as the first author (McCain, 1990). This definition has rarely been challenged. Persson (2001) is the first empirical study that compares the potential difference in intellectual structure between mappings done by first-author and all-author co-citation analyses. The study is based on 7001 source documents from library and information science journals in the CD-ROM version of Social Science Citation Index 1986-1996. The study investigates how these source documents have been co-cited with each other within the dataset by use of multidimensional scaling (MDS). The co-citations for source documents amount to some 7% of the total number of references in the dataset; the remaining 93% go to non-source documents not indexed by the Thompson ISI citation indexes. The study demonstrates that first-author ACA leaves out several influential researchers compared to all-author ACA, although the subfield structure tends to be just about the same for both methods. The study is somewhat limited due to the dependence on a limited set of source documents, the sparse details provided concerning the definition and calculation of co-citations, and finally the informal evaluation procedures. Nevertheless, the results are indicative as they are somewhat confirmed in a smaller study done by Zhao (2006). All-author vs. First-author Co-citation Analyses Zhao (2006) is the hitherto must detailed theoretical and empirical investigation of all-author cocitation analysis, including a definition of co-citation counts reminiscent of the definitions given earlier by Rousseau and Zuccala (2004). The study defines three different counting methods: firstauthor co-citation; inclusive all-author co-citation; and exclusive all-author co-citation. Likewise, as a consequence of all-author co-citation analysis, the study redefines “...an author’s oeuvre as all works with this author as one of the authors of each of the works.” (Zhao, 2006, p. 1580). The distinction between inclusive and exclusive all-author co-citations refers to the immediate implication of the above definition of all-author co-citation counting of author’s oeuvres, as two authors may also be considered as being co-cited when a paper that the two authors co-authored is cited. Thus, coauthorships when cited can also be counted into co-citations. This means that inclusive all-author cocitation analysis counts cited co-authorships, whereas exclusive all-author co-citation analysis does not. Typically author co-citations and co-authorships are treated as different units of analysis, where the former is used to map intellectual structures and the latter to investigate research collaboration. Rousseau and Zuccala (2004), in their definition, suggest that such an approach supports the view that authors, regardless of their overall authorship ranking, can contribute substantially to the development of a research area, and that it presents a more accurate portrayal of an individual author’s contribution to a research area where high rates of co-authorship are prevalent. Besides the novel definition of all-author co-citation counting, Zhao (2006) adheres to a traditional Drexel-approach to ACA (see below). The dataset was rather small: it consisted of 312 publications in PDF on the subject of XML identified using CiteSeer. The 312 publications contained 4578 citations, which was used a basis for the co-citation analysis. The results of the study indicate that all-author cocitation counting creates more coherent groups of authors, which supposedly should be considerably clearer to identify and interpret. Nevertheless, due to the straightforward application of citation thresholds for including cited authors in the study, the results also show that all-author co-citation count can lead to identification of fewer specialties in a research field compared to first-author cocitation counting – that is when the same number of top-ranked authors is selected and analyzed (Zhao, 2006). Zhao (2006) undoubtedly contributes considerably to our understanding of all-author co-citation analysis. However, for the time being, the results of the empirical study must be treated carefully until we have more substantial evidence that may or may not support its findings. The motivation for the present paper is therefore to continue the work of Zhao (2006) by further investigating inclusive allauthor co-citation analysis in order to bring about deeper empirical understanding and evidence concerning this novel counting approach. The present study is the first in a series that addresses the research possibilities inherent in a citation index based on source documents formatted in XML. One such possibility is all-author co-citation analysis, and the present study is based on the hitherto largest set of citing documents applied in an all-author co-citation analysis. Co-citation Matrix Generation Most recently the role played by matrices in co-citation analyses has received attention. Leydesdorff and Vaughan (2006) demonstrate the fundamental difference between asymmetric data matrices (n × m) and symmetric proximity matrices (n × n), arguing that symmetric matrices of co-occurrence counts are per se proximity matrices and should be treated as such. 1 http://citeseer.ist.psu.edu/ In the Drexel-approach to ACA, first author co-citation counts are obtained by online retrieval. Subsequently the co-citation counts are entered into a symmetric proximity matrix. However, the desire to apply factor analysis to ACA as a more detailed exploratory tool in order to identify latent structures and thus help interpret the mapping results, necessitates a symmetric proximity matrix of covariance or correlation coefficients. In traditional multivariate analyses such proximity matrices are derived from an asymmetric data matrix of variables by cases. However, such a matrix is not available in the Drexel-approach due to the paired online counting. As a result an unorthodox procedure is devised, where the proximity matrix of co-citation counts are transformed into an additional proximity matrix of derived correlation coefficients of first author co-citation profiles. Note that a linear transformation of a symmetric proximity matrix is not straightforward. A theoretical problem arises, as all relations in a symmetric matrix occur twice which evidently leads to a magnification. Further, the transformation also causes a fundamental problem in relation to the interpret

[1]  P. Schönemann,et al.  Fitting one matrix to another under choice of a central dilation and a rigid motion , 1970 .

[2]  John C. Gower,et al.  Statistical methods of comparing different multivariate analyses of the same data , 1971 .

[3]  Pia Borlund,et al.  Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results , 2007, J. Assoc. Inf. Sci. Technol..

[4]  Loet Leydesdorff,et al.  Co-occurrence matrices and their applications in information science: Extending ACA to the Web environment , 2006 .

[5]  Dangzhi Zhao,et al.  Towards all-author co-citation analysis , 2006, Inf. Process. Manag..

[6]  Loet Leydesdorff,et al.  Co-occurrence matrices and their applications in information science: Extending ACA to the Web environment , 2006, J. Assoc. Inf. Sci. Technol..

[7]  P. Green,et al.  Analyzing multivariate data , 1978 .

[8]  Howard D. White,et al.  Author cocitation: A literature measure of intellectual structure , 1981, J. Am. Soc. Inf. Sci..

[9]  Olle Persson All author citations versus first author citations , 2004, Scientometrics.

[10]  Pia Borlund,et al.  Matrix comparison, Part 2: Measuring the resemblance between proximity measures or ordination results by use of the mantel and procrustes statistics , 2007, J. Assoc. Inf. Sci. Technol..

[11]  M. Chan,et al.  Risk factors for citation errors in peer-reviewed nursing journals. , 2001, Journal of advanced nursing.

[12]  Katherine W. McCain,et al.  Mapping authors in intellectual space: A technical overview , 1990, J. Am. Soc. Inf. Sci..

[13]  N. Mantel The detection of disease clustering and a generalized regression approach. , 1967, Cancer research.

[14]  Jesper W. Schneider,et al.  Matrix comparison, Part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results , 2007 .

[15]  Chaomei Chen,et al.  Visualising Semantic Spaces and Author Co-Citation Networks in Digital Libraries , 1999, Inf. Process. Manag..

[16]  Ronald Rousseau,et al.  Requirements for a cocitation similarity measure, with special reference to Pearson's correlation coefficient , 2003, J. Assoc. Inf. Sci. Technol..

[17]  Ronald Rousseau,et al.  Author cocitation analysis and Pearson's r , 2004, J. Assoc. Inf. Sci. Technol..

[18]  Wolfgang Glänzel,et al.  The need for standards in bibliometric research and technology , 2005, Scientometrics.

[19]  Ronald Rousseau,et al.  A classification of author co-citations: Definitions and search strategies , 2004, J. Assoc. Inf. Sci. Technol..

[20]  Gabriella Kazai,et al.  Overview of INEX 2005 , 2005, INEX.

[21]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.