Definition and Formalization of Entity Resolution Functions for Everyday Information Integration

Data integration on a human-manageable scale, by users without database expertise, is a more common activity than integration of large databases. Users often gather fine-grained data and organize it in an entity-centric way, developing tables of information regarding real-world objects, ideas, or people. Often, they do this by copying and pasting bits of data from e-mails, databases, or text files into a spreadsheet. During this process, users evolve their notions of entities and attributes. They combine sets of entities or attributes, split them again, update attribute values, and retract those updates. These functions are neither well supported by current tools, nor formally well understood. Our research seeks to capture and make explicit the data integration decisions made during these activities. In this paper, we formally define entity resolution and de-resolution, and show that these functions behave predictably and intuitively in the presence of attribute value updates.

[1]  Lois M. L. Delcambre,et al.  Bundles in captivity: an application of superimposed information , 2001, Proceedings 17th International Conference on Data Engineering.

[2]  Mattis Neiling,et al.  The Object Identification Framework , 2003 .

[3]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[4]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[5]  Azadeh Shakery,et al.  Toward Entity Retrieval over Structured and Text Data , 2004 .

[6]  Lois M. L. Delcambre,et al.  Putting Integrated Information in Context: Superimposing Conceptual Models with SPARCE , 2004, APCCM.

[7]  Edward L. Robertson,et al.  A formal characterization of PIVOT/UNPIVOT , 2005, CIKM '05.

[8]  Paul Hsiung,et al.  Alias Detection in Link Data Sets , 2004 .

[9]  Jayant Madhavan,et al.  Personal information management with SEMEX , 2005, SIGMOD '05.

[10]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[11]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[12]  Eric R. Ziegel,et al.  Business survey methods , 1995 .

[13]  Bradley Malin,et al.  Unsupervised Name Disambiguation via Social Network Similarity , 2005 .

[14]  Lois M. L. Delcambre,et al.  Capturing Users' Everyday, Implicit Information Integration Decisions , 2007, ER.

[15]  Hector Garcia-Molina Entity Resolution: Overview and Challenges , 2004, ER.

[16]  Hector Garcia-Molina Pair-Wise entity resolution: overview and challenges , 2006, CIKM '06.