Soliciting User Feedback in a Dataspace System

A primary challenge to large-scale data integration is creating semantic equivalences between elements from different data sources that correspond to the same real-world entity or concept. Dataspaces propose a pay-as-you-go approach: automated mechanisms such as schema matching and reference reconciliation provide a initial correspondences, termed candidate matches, and then user feedback is used to incrementally confirm these matches. The key to this approach is to determine in what order to solicit user feedback for confirming candidate matches. In this paper, we develop a decision-theoretic framework for ordering candidate matches for user confirmation using the concept of the value of perfect information (VPI ). At the core of this concept is a utility function that quantifies the desirability of a given state; thus, we devise a utility function for dataspaces based on query result quality. We show in practice how to efficiently apply VPI in concert with this utility function to order user confirmations. A detailed experimental evaluation shows that the ordering of user feedback produced by this VPI-based approach yields a dataspace with a significantly higher utility than a wide range of other ordering strategies. Finally, we outline the design of Roomba, a system that incorporates this decisiontheoretic framework to guide a dataspace in soliciting user feedback in a pay-as-you-go manner.

[1]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[2]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[3]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[4]  G. Chapman,et al.  Decision Making in Health Care: Theory, Psychology, and Applications , 2003 .

[5]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[6]  DoanAnHai,et al.  Semantic-integration research in the database community , 2005 .

[7]  L. Evans The Large Hadron Collider , 2007 .

[8]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[9]  Mark Claypool,et al.  Implicit interest indicators , 2001, IUI '01.

[10]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[11]  Joseph Y. Halpern,et al.  Least expected cost query optimization: an exercise in utility , 1999, PODS.

[12]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[13]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[14]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[15]  Thorsten Joachims,et al.  Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .

[16]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[17]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[18]  Wei Hong,et al.  A macroscope in the redwoods , 2005, SenSys '05.

[19]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[20]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[21]  Jayant Madhavan,et al.  Structured Data Meets the Web: A Few Observations , 2006, IEEE Data Eng. Bull..

[22]  J. Neumann,et al.  Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[23]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[24]  Eric Horvitz,et al.  Models of attention in computing and communication , 2003, Commun. ACM.

[25]  A. Mas-Colell,et al.  Microeconomic Theory , 1995 .

[26]  James Surowiecki The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations Doubleday Books. , 2004 .