Finding Quality in Quantity: The Challenge of Discovering Valuable Sources for Integration

Data is becoming a commodity of tremendous value for many domains. This is leading to a rapid increase in the number of data sources and public access data services, such as cloud-based data markets and data portals, that facilitate the collection, publishing and trading of data. Data sources typically exhibit wide variety and heterogeneity in the types or schemas of the data they provide, their quality, and the fees they charge for accessing their data. Users who want to build upon such publicly available data, must (i) discover sources that are relevant to their applications, (ii) identify sources that collectively satisfy the quality and budget requirements of their applications, with few effective clues about the quality of the sources, and (iii) repeatedly invest many person-hours in assessing the eventual usefulness of data sources. All three steps require investigating the content of the sources manually, integrating them and evaluating the actual benefit of the integration result for a desired application. Unfortunately, when the number of data sources is large, humans have a limited capability of reasoning about the actual quality of sources and the trade-offs between the benefits and costs of acquiring and integrating sources. In this paper we explore the problems of automatically appraising the quality of data sources and identifying the most valuable sources for diverse applications. We introduce our vision for a new data source management system that automatically assesses the quality of data sources based on a collection of rigorous data quality metrics and enables the automated and interactive discovery of valuable sources for user applications. We argue that the proposed system can dramatically simplify the Discover-Appraise-Evaluate interaction loop that many users follow today to discover sources for their applications.

[1]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[2]  William W. Cohen,et al.  Regularization of Latent Variable Models to Obtain Sparsity , 2013, SDM.

[3]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[4]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[5]  Dan Suciu,et al.  Data Markets in the Cloud: An Opportunity for the Database Community , 2011, Proc. VLDB Endow..

[6]  Richard Y. Wang,et al.  Data Quality Assessment , 2002 .

[7]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.

[8]  Jarek Gryz,et al.  Algorithms and analyses for maximal vector computation , 2007, The VLDB Journal.

[9]  Naren Ramakrishnan,et al.  Detecting and forecasting domestic political crises: a graph-based approach , 2014, WebSci '14.

[10]  Nir Friedman,et al.  Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning , 2009 .

[11]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[12]  Divesh Srivastava,et al.  Characterizing and selecting fresh data sources , 2014, SIGMOD Conference.

[13]  Ben Taskar,et al.  Probabilistic Relational Models , 2014, Encyclopedia of Social Network Analysis and Mining.

[14]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[15]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[16]  Divesh Srivastava,et al.  Sailing the Information Ocean with Awareness of Currents: Discovery and Application of Source Dependence , 2009, CIDR.

[17]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[18]  Valter Crescenzi,et al.  Extraction and Integration of Partially Overlapping Web Sources , 2013, Proc. VLDB Endow..

[19]  Divesh Srivastava,et al.  Fusing data with correlations , 2014, SIGMOD Conference.

[20]  Christian S. Jensen,et al.  Google fusion tables: data management, integration and collaboration in the cloud , 2010, SoCC '10.

[21]  T. Simpson,et al.  Efficient Pareto Frontier Exploration using Surrogate Approximations , 2000 .

[22]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[23]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[24]  Sebastian Schutte,et al.  Matched Wake Analysis: Finding Causal Relationships in Spatiotemporal Event Data , 2014 .

[25]  Nir Friedman,et al.  Learning Hidden Variable Networks: The Information Bottleneck Approach , 2005, J. Mach. Learn. Res..

[26]  Hakan Hacigümüs,et al.  Affordable Analytics on Expensive Data , 2014, Data4U '14.

[27]  Renée J. Miller,et al.  Discovering Linkage Points over Web Data , 2013, Proc. VLDB Endow..

[28]  Beng Chin Ooi,et al.  A hybrid machine-crowdsourcing system for matching web tables , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[29]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..