Learning metadata from the evidence in an on-line citation matching scheme

Citation matching, or the automatic grouping of bibliographic references that refer to the same document, is a data management problem faced by automatic digital libraries for scientific literature such as CiteSeer and Google Scholar. Although several solutions have been offered for citation matching in large bibliographic databases, these solutions typically require expensive batch clustering operations that must be run offline. Large digital libraries containing citation information can reduce maintenance costs and provide new services through efficient online processing of citation data, resolving document citation relationships as new records become available. Additionally, information found in citations can be used to supplement document metadata, requiring the generation of a canonical citation record from merging variant citation subfields into a unified "best guess" from which to draw information. Citation information must be merged with other information sources in order to provide a complete document record. This paper outlines a system and algorithms for online citation matching and canonical metadata generation. A Bayesian framework is employed to build the ideal citation record for a document that carries the added advantages of fusing information from disparate sources and increasing system resilience to erroneous data

[1]  C. Lee Giles,et al.  Distributed error correction , 1999, DL '99.

[2]  David M. Pennock,et al.  Persistence of information on the web: analyzing citations contained in research articles , 2000, CIKM '00.

[3]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[4]  Abdel Belaïd,et al.  Citation recognition for scientific publications in digital libraries , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[5]  Soongoo Hong,et al.  Objective quality ranking of computing journals , 2003, CACM.

[6]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[7]  Donna Bergmark Automatic Extraction of Reference Linking Information from Online Documents , 2000 .

[8]  D. Edge Quantitative Measures of Communication in Science: A Critical Review , 1979, History of science; an annual review of literature, research and teaching.

[9]  Hui Han,et al.  A service-oriented architecture for digital libraries , 2004, ICSOC '04.

[10]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[11]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[12]  Andrew McCallum,et al.  An Integrated, Conditional Model of Information Extraction and Coreference with Appli , 2004, UAI.

[13]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[14]  Stuart J. Russell,et al.  First-Order Probabilistic Models for Information Extraction , 2003 .

[15]  Leo Egghe,et al.  Co-citation, bibliographic coupling and a characterization of lattice citation networks , 2002, Scientometrics.

[16]  Foster J. Provost,et al.  The myth of the double-blind review?: author identification using only citations , 2003, SKDD.

[17]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[18]  Sunita Sarawagi,et al.  Resolving citations in a paper repository , 2003, SKDD.

[19]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.