Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation

Databases often contain uncertain and imprecise references to real-world entities. Entity resolution, the process of reconciling multiple references to underlying real-world entities, is an important data cleaning process required before accurate visualization or analysis of the data is possible. In many cases, in addition to noisy data describing entities, there is data describing the relationships among the entities. This relational data is important during the entity resolution process; it is useful both for the algorithms which determine likely database references to be resolved and for visual analytic tools which support the entity resolution process. In this paper, we introduce a novel user interface, D-Dupe, for interactive entity resolution in relational data. D-Dupe effectively combines relational entity resolution algorithms with a novel network visualization that enables users to make use of an entity's relational context for making resolution decisions. Since resolution decisions often are interdependent, D-Dupe facilitates understanding this complex process through animations which highlight combined inferences and a history mechanism which allows users to inspect chains of resolution decisions. An empirical study with 12 users confirmed the benefits of the relational context visualization on the performance of entity resolution tasks in relational data in terms of time as well as users' confidence and satisfaction.

[1]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[2]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[3]  Ben Shneiderman,et al.  Exploring personal media: A spatial interface supporting user-defined semantic regions , 2006, J. Vis. Lang. Comput..

[4]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[5]  Stephen Travis Pope,et al.  A cookbook for using the model-view controller user interface paradigm in Smalltalk-80 , 1988 .

[6]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[7]  Ulrik Brandes,et al.  Exploratory Network Visualization: Simultaneous Display of Actor Status and Connections , 2001, J. Soc. Struct..

[8]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[9]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[10]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[11]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[12]  Benjamin B. Bederson,et al.  Toolkit design for interactive structured graphics , 2004, IEEE Transactions on Software Engineering.

[13]  Ben Shneiderman,et al.  Network Visualization by Semantic Substrates , 2006, IEEE Transactions on Visualization and Computer Graphics.

[14]  Jeffrey Heer,et al.  prefuse: a toolkit for interactive information visualization , 2005, CHI.

[15]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[16]  Lise Getoor,et al.  GeoDDupe: A Novel Interface for Interactive Entity Resolution in Geospatial Data , 2007, 2007 11th International Conference Information Visualization (IV '07).

[17]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[18]  B. Wellman The Development of Social Network Analysis: A Study in the Sociology of Science , 2008 .

[19]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[20]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[21]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[22]  Ulrik Brandes,et al.  Visone Software for Visual Social Network Analysis , 2001 .

[23]  Ben Shneiderman,et al.  D-Dupe: An Interactive Tool for Entity Resolution in Social Networks , 2006, 2006 IEEE Symposium On Visual Analytics Science And Technology.

[24]  Ivan Herman,et al.  Graph Visualization and Navigation in Information Visualization: A Survey , 2000, IEEE Trans. Vis. Comput. Graph..

[25]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[26]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Eytan Adar,et al.  GUESS: a language and interface for graph exploration , 2006, CHI.

[28]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[29]  Linton C. Freeman,et al.  Visualizing Social Networks , 2000, J. Soc. Struct..

[30]  Lisa Singh,et al.  Visual analysis of dynamic group membership in temporal social networks , 2007, SKDD.

[31]  Lise Getoor,et al.  Entity Resolution in Graphs , 2005 .

[32]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[33]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[34]  Padhraic Smyth,et al.  Analysis and Visualization of Network Data using JUNG , 2005 .

[35]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[36]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.