Effects of Unpopular Citation Fields in Citation Matching Performance

Citation matching is a problem of identifying which citations correspond to the same publication. Previous studies on citation matching select typically from a corpus or database of citation records, such as CORA, an arbitrary set of citation record fields such as author, title - a practice informed by "common sense" - in order to automatically group citations that refer to the same document. This study describes a systematic and computational approach to extract out the 'best candidate' citation record fields, to propose that there is always the best combination of citation record fields that helps increase citation matching performance and is applicable regardless of which research framework one may adopt, such as Machine Learning methods or Information Retrieval algorithms. Cross comparisons between previous studies and our approach, shown as pairwise F1 measures, within our framework based on field selection are presented.

[1]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[2]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[3]  Robert L. Goldstone,et al.  The simultaneous evolution of author and paper networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Andrew McCallum,et al.  Bi-directional Joint Inference for Entity Resolution and Segmentation Using Imperatively-Defined Factor Graphs , 2009, ECML/PKDD.

[5]  Karsten P. Ulland,et al.  Vii. References , 2022 .

[6]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[8]  Anthony F. J. van Raan,et al.  For Your Citations Only? Hot Topics in Bibliometric Analysis , 2005 .

[9]  Sunita Sarawagi,et al.  Resolving citations in a paper repository , 2003, SKDD.

[10]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[11]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[12]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[13]  Andrew McCallum,et al.  An Entity Based Model for Coreference Resolution , 2009, SDM.

[14]  E Garfield,et al.  "Science Citation Index"--A New Dimension in Indexing. , 1964, Science.

[15]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[16]  Sandip Debnath,et al.  Learning metadata from the evidence in an on-line citation matching scheme , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).