On the problems of dealing with bibliometric data

Dear Sir, An important component of scientific work is to generate data in experiments and computer simulations and to interpret them. However, these data are rarely self-explanatory and often can only be understood within the narrow specialist context in which they arose. Knowledge of their accuracy or faultiness and of the possibility of bias is key to meaningful interpretation. Scientific data are not made up of independent units that speak for themselves: They must always be rooted in the history of their creation and looked at in the context of the subject matter. The best way to assess data realistically is to generate the data oneself and to trace their development from the raw data through any selection and concentration to the final depiction and interpretation. This is the only way to develop an appreciation of the meaning of data and their value to science. This appreciation is also the most important basis for detecting possible anomalies (which not infrequently have been and are the starting point for the discovery of fundamental phenomena) and for distinguishing them from statistical fluctuations and errors. Quantitative (bibliometric) methods for measuring the productivity and impact of research performance are particularly at risk from the incorrect interpretation of data. This is because research into the underlying data is usually separate from its interpretation and application for the purposes of evaluating research. There are essentially three groups involved in this process: (a) The database producers (primarily Thomson Reuters, producer of the Web of Science and Elsevier, producer of Scopus), (b) the bibliometricians, and (c) the end-users. The producers create the databases and offer appropriate search systems. The bibliometricians are the experts who analyze the publication and citation data. Known as bibliometrics (or scientometrics), this discipline has grown dynamically and rapidly and has its own journals, conferences, and academic departments. An end-user is anyone interested in the research performance of specific entities (individual scientists, research groups, or universities). For example, researchers are interested in the impact of their own work and that of their colleagues and increasingly find themselves required to supply bibliometric data throughout their academic career—when applying for a new position, for instance. Decision-makers in the management or administration of universities may use these data for research applications, for appointments, when considering the award of prizes for research, when giving an account of their institutions in evaluations, or for press and publicity purposes. Membership of more than one group is possible. For example, some end-users have acquired bibliometric skills (or think they have) and undertake searching for their own purposes. On the other hand, database producers and some research institutions that have specialized in bibliometrics, such as the Centre for Science & Technology Studies (CWTS) in Leiden, offer pre-processed data (e.g., for the purposes of evaluation). Thomson Reuters provides tools for analyses (InCites) that are carried out by bibliometricians, and the specialist institutions generate bibliometric data sets that are made available by database producers. When endusers themselves access the Web of Science or Scopus for bibliometric analyses, their research and interpretations are often based on insufficient knowledge. They may lack the appropriate experience with which to accurately distinguish people (problem of namesakes) and institutions (problem of address variants). Further problems arise when only the best-known indicators (and frequently only one of them) are used, such as the h-index or the Journal Impact Factor (JIF). As a rule, these people perform analyses without any normalization for subject area and publication year and nevertheless use them to make statements across subject areas and periods of time. This raises issues of fairness and ethics of which the end-users are not aware. They are frequently of the opinion that counting publications and their citations is a quite simple matter and that the data as a result are selfexplanatory. “Advanced bibliometric methods have now come to a stage of providing excitement instead of ‘just easy data’. . . . An important, absolutely necessary condition is that applied citation analysis is part of an advanced, technically highly developed bibliometric method.” (van Raan, 2000, p. 301, 306). Bibliometric data used in evaluations are politically critical and associated with strong interests (in particular reputation and money). “Bibliometric indicators have become such a powerful tool within the context of science policy that consideration must be given to their potential for misleading and destructive use. Their potency requires a code of professional ethics to govern their application” (Weingart, 2005, p. 120). Primarily this means applying the best and fairest approach available in the current bibliometric community (i.e., the most appropriate indicators and not the simplest and cheapest) and also that the limitations of the method and potential distortions are pointed out (Marx & Bornmann, 2013). Scientists, who should be used to handling bibliometric data as end-users, should be able to understand the limitations © 2013 ASIS&T