A Comparison of On-Line Computer Science Citation Databases

This paper examines the difference and similarities between the two on-line computer science citation databases DBLP and CiteSeer. The database entries in DBLP are inserted manually while the CiteSeer entries are obtained autonomously via a crawl of the Web and automatic processing of user submissions. CiteSeer's autonomous citation database can be considered a form of self-selected on-line survey. It is important to understand the limitations of such databases, particularly when citation information is used to assess the performance of authors, institutions and funding bodies. We show that the CiteSeer database contains considerably fewer single author papers. This bias can be modeled by an exponential process with intuitive explanation. The model permits us to predict that the DBLP database covers approximately 24% of the entire literature of Computer Science. CiteSeer is also biased against low-cited papers. Despite their difference, both databases exhibit similar and significantly different citation distributions compared with previous analysis of the Physics community. In both databases, we also observe that the number of authors per paper has been increasing over time.

[1]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[2]  S. Redner How popular is your paper? An empirical study of the citation distribution , 1998, cond-mat/9804163.

[3]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[4]  David M. Pennock,et al.  REFEREE: An Open Framework for Practical Testing of Recommender Systems using ResearchIndex , 2002, VLDB.

[5]  Derek J. de Solla Price,et al.  "Little Science, Big Science", Derek J. de Solla Price, New York-London 1963 : [recenzja] / Janusz Thor. , 1964 .

[6]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[7]  M. Batty The Geography of Scientific Citation , 2003 .

[8]  D. Sornette,et al.  Stretched exponential distributions in nature and economy: “fat tails” with characteristic scales , 1998, cond-mat/9801293.

[9]  A. Vázquez Statistics of citation networks , 2001, cond-mat/0105031.

[10]  C. Tsallis,et al.  Are citations of scientific papers a case of nonextensivity? , 1999, cond-mat/9903433.

[11]  M. Batty Citation geography: It's about location , 2003 .

[12]  S. Lawrence Free online availability substantially increases a paper's impact , 2001, Nature.

[13]  Janne S. Kotiaho,et al.  Papers vanish in mis-citation black hole , 1999, Nature.

[14]  Alexander Weber,et al.  Browsing and visualizing digital bibliographic data , 2004, VISSYM'04.

[15]  R. Kostoff The (scientific) wealth of nations , 2004 .

[16]  Mee-Jean Kim A comparative study of citations from papers by Korean scientists and their journal attributes , 1998, J. Inf. Sci..

[17]  Janne S. Kotiaho,et al.  Unfamiliar citations breed mistakes , 1999, Nature.

[18]  Vwani P. Roychowdhury,et al.  Read Before You Cite! , 2003, Complex Syst..

[19]  A. D. Jackson,et al.  Citation networks in high energy physics. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[21]  John Riedl,et al.  Shilling recommender systems for fun and profit , 2004, WWW '04.

[22]  Steve Lawren Online or invisible ? , 2001 .

[23]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[24]  M. Newman,et al.  The structure of scientific collaboration networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.