RefConcile - Automated Online Reconciliation of Bibliographic References

Comprehensive bibliographies often rely on community contributions. In such settings, de-duplication is mandatory for the bibliography to be useful. Ideally, de-duplication works online, i.e., when adding new references, so the bibliography remains duplicate-free at all times. While de-duplication is well researched, generic approaches do not achieve the result quality required for automated reconciliation. To overcome this problem, we propose a new duplicate detection and reconciliation technique called RefConcile. Aiming specifically at bibliographic references, it uses dedicated blocking and matching techniques tailored to this type of data. Our evaluation based on a large real-world collection of bibliographic references shows that RefConcile scales well, and that it detects and reconciles duplicates highly accurately.

[1]  Huan Wang,et al.  A density-based clustering structure mining algorithm for data streams , 2012, BigMine '12.

[2]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[3]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[4]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[5]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[6]  Jeffrey Beall Measuring duplicate metadata records in library databases , 2010 .

[7]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[8]  Andreas Thor,et al.  MOMA - A Mapping-based Object Matching System , 2007, CIDR.

[9]  Erhard Rahm,et al.  Training selection for tuning entity matching , 2008, QDB/MUD.

[10]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[13]  Klemens Böhm,et al.  Improved bibliographic reference parsing based on repeated patterns , 2014, International Journal on Digital Libraries.

[14]  Felix Naumann,et al.  A Comparison and Generalization of Blocking and Windowing Algorithms for Duplicate Detection , 2009 .

[15]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[16]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[17]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[18]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[19]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[20]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[21]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[22]  T. Blakely,et al.  Probabilistic record linkage and a method to calculate the positive predictive value. , 2002, International journal of epidemiology.

[23]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[24]  David Geer Reducing the Storage Burden via Data Deduplication , 2008, Computer.

[25]  Karen Davies Reference accuracy in library and information science journals , 2012, Aslib Proc..

[26]  David King,et al.  Towards a universal bibliography – the RefBank approach , 2012 .

[27]  A. Polaszek A universal register for animal names , 2005, Nature.