Effective and scalable solutions for mixed and split citation problems in digital libraries

In this paper, we consider two important problems that commonly occur in bibliographic digital libraries, which seriously degrade their data qualities: Mixed Citation (MC) problem (i.e., citations of different scholars with their names being homonyms are mixed together) and Split Citation (SC) problem (i.e., citations of the same author appear under different name variants). In particular, we investigate an effective yet scalable solution since citations in such digital libraries tend to be large-scale. After formally defining the problems and accompanying challenges, we present an effective solution that is based on the state-of-the-art sampling-based approximate join algorithm. Our claim is verified through preliminary experimental results.

[1]  José Manuel Barrueco Cruz,et al.  Personal Data in a Large Digital Library , 2000, ECDL.

[2]  James W. Warner,et al.  Automated name authority control , 2001, JCDL '01.

[3]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[4]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[5]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[6]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[7]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[8]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[9]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[10]  M. M. M. Snyman,et al.  Revolutionizing name authority control , 2000, DL '00.

[11]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[12]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[13]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[14]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[15]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[16]  Jeremy A. Hylton,et al.  Identifying and Merging Related Bibliographic Records , 1996 .

[17]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[18]  William E. Winkler,et al.  AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIAL CENSUS , 1987 .

[19]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[20]  Judith L. Klavans,et al.  Methods for precise named entity matching in digital collections , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[21]  Byung-Won On,et al.  System Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach , 2004, ECDL.

[22]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[23]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[24]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[25]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .