A generic Web-based entity resolution framework

Web data repositories usually contain references to thousands of real-world entities from multiple sources. It is not uncommon that multiple entities share the same label (polysemes) and that distinct label variations are associated with the same entity (synonyms), which frequently leads to ambiguous interpretations. Further, spelling variants, acronyms, abbreviated forms, and misspellings compound to worsen the problem. Solving this problem requires identifying which labels correspond to the same real-world entity, a process known as entity resolution. One approach to solve the entity resolution problem is to associate an authority identifier and a list of variant forms with each entity—a data structure known as an authority file. In this work, we propose a generic framework for implementing a method for generating authority files. Our method uses information from the Web to improve the quality of the authority file and, because of that, is referred to as WER—Web-based Entity Resolution. Our contribution here is threefold: (a) we discuss how to implement the WER framework, which is flexible and easy to adapt to new domains; (b) we run extended experimentation with our WER framework to show that it outperforms selected baselines; and (c) we compare the results of a specialized solution for author name resolution with those produced by the generic WER framework, and show that the WER results remain competitive. © 2011 Wiley Periodicals, Inc.

[1]  Byung-Won On,et al.  Effective and scalable solutions for mixed and split citation problems in digital libraries , 2005, IQIS '05.

[2]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[3]  James W. Warner,et al.  Automated name authority control , 2001, JCDL '01.

[4]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[5]  Larry Auld,et al.  Authority Control: An Eighty-Year Review. , 1982 .

[6]  Zhang Mei Personal Name Identification in the Practice of Digital Repositories , 2008 .

[7]  Dongwon Lee,et al.  Search engine driven author disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[8]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[9]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[10]  Judith L. Klavans,et al.  Methods for precise named entity matching in digital collections , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[11]  Andrew McCallum,et al.  An Entity Based Model for Coreference Resolution , 2009, SDM.

[12]  Barbara B. Tillett PhD and Mls and Ba Authority Control: State of the Art and New Perspectives , 2009 .

[13]  Dmitri V. Kalashnikov,et al.  Towards breaking the quality curse.: a web-querying approach to web people search. , 2008, SIGIR '08.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[16]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[17]  Ganesh Ramakrishnan,et al.  Collective annotation of Wikipedia entities in web text , 2009, KDD.

[18]  Danushka Bollegala,et al.  Mining for personal name aliases on the web , 2008, WWW.

[19]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[20]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[21]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[22]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[23]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[24]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[25]  Yi Zhang,et al.  Web based linkage , 2007, WIDM '07.

[26]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[27]  Won-Kyung Sung,et al.  On co-authorship for author disambiguation , 2009, Inf. Process. Manag..

[28]  Marcos André Gonçalves,et al.  An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations , 2010 .

[29]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[30]  Thomas B. Hickey,et al.  NACO Normalization: A Detailed Examination of the Authority File Comparison Rules , 2006 .

[31]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[32]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[33]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[34]  Hector Garcia-Molina,et al.  D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[35]  Paul Ogilvie,et al.  Acrophile: an automated acronym extractor and server , 2000, DL '00.

[36]  Rodrygo L. T. Santos,et al.  Keeping a digital library clean: new solutions to old problems , 2008, DocEng '08.

[37]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[38]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[39]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[40]  Michael Fitzgerald,et al.  Google ajax search api , 2007 .

[41]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[42]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[43]  Berthier A. Ribeiro-Neto,et al.  Using web information for creating publication venue authority files , 2008, JCDL '08.

[44]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[45]  Barbara B. Tillett,et al.  VIAF (virtual international authority file) : Linking the deutsche nationalbibliothek and library of congress name authority files , 2007 .

[46]  Byung-Won On,et al.  Are your citations clean? , 2007, CACM.

[47]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[48]  Cheng Li,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[49]  BhattacharyaIndrajit,et al.  Collective entity resolution in relational data , 2007 .

[50]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[51]  Martha Fallahay Loesch VIAF (The Virtual International Authority File) – http://viaf.org , 2011 .

[52]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[53]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[54]  Marcos André Gonçalves,et al.  Learning to deduplicate , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[55]  M. M. M. Snyman,et al.  Revolutionizing name authority control , 2000, DL '00.

[56]  Berthier A. Ribeiro-Neto,et al.  Using web information for author name disambiguation , 2009, JCDL '09.

[57]  Andrew MacEwan Project InterParty: From Library Authority Files to E-Commerce , 2003 .

[58]  Min-Yen Kan,et al.  Record matching in digital library metadata , 2008, CACM.

[59]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[60]  James C. French,et al.  Using clustering strategies for creating authority files , 2000, J. Am. Soc. Inf. Sci..

[61]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.