Improving citation mining

In recent years the number of citations a paper is receiving is seen more and more (maybe too much so) as an important indicator for the quality of a paper, the quality of researchers, the quality of journals, etc. Based on the number of citations a scholar has received over his lifetime or over the last few years various measures have been introduced. The number of citations (often without counting self-citations or citations from “minor” sources, in whatever way this may be defined), or some measurement based on the number of citations (like the h- or the g-factor) are being used to evaluate scholars; the citation index of a journal (again with a variety of parameters) is seen as measuring the impact of the journal, and hence the importance one assigns to publications there, etc. The number of measurements based on citation numbers is steadily increasing, and their definition has become a science in itself. However, they all rest on finding all relevant citations. Thus, “citation mining tools” used for the ISI Web of Knowledge, the Citeseer citation index, Google scholar or software such as the “publishorperish.com” software based on Google scholar, etc., are the critical starting points for all measurement efforts. In this paper we show that the current citation mining techniques do not discover all relevant citations. We propose a technique that increases accuracy substantially and show numeric evaluations for one typical journal. It is clear that in the absence of very reliable citation mining tools all current measurements based on citation counting should be considered with a grain of salt.

[1]  E. Garfield,et al.  Citation indexes for science. , 1956, Science.

[2]  Mary Elizabeth Stevens,et al.  Statistical Association Methods for Mechanized Documentation. , 1967 .

[3]  E. Garfield Citation analysis as a tool in journal evaluation. , 1972, Science.

[4]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[5]  Cristian S. Calude,et al.  Journal of Universal Computer Science , 1994, J. Univers. Comput. Sci..

[6]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[7]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[8]  Gobinda G. Chowdhury,et al.  Template mining for the extraction of citation from digital documents , 2001 .

[9]  S. N. Dorogovtsev,et al.  Evolution of networks , 2001, cond-mat/0106144.

[10]  Hillel Frisch Reference Reviews , 2003 .

[11]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[12]  E. Garfield Citation indexes for science. A new dimension in documentation through association of ideas. 1955. , 1955, International journal of epidemiology.

[13]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..

[14]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[15]  Daniel C. Postellon Hall and Keynes join Arbor in the citation indexes. , 2008, Nature.