Big data integration

The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data. BDI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This seminar explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.

[1]  Felix Naumann,et al.  Adaptive Windows for Duplicate Detection , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[2]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[3]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[4]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[5]  Carina F. Dorneles,et al.  Web table taxonomy and formalization , 2013, SGMD.

[6]  Vahab S. Mirrokni,et al.  Maximizing Non-Monotone Submodular Functions , 2011, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[7]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[8]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[9]  Serge Abiteboul,et al.  PARIS: Probabilistic Alignment of Relations, Instances, and Schema , 2011, Proc. VLDB Endow..

[10]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[11]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[12]  Gerhard Weikum,et al.  From information to knowledge: harvesting entities and relationships from web sources , 2010, PODS '10.

[13]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[14]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[15]  Divesh Srivastava,et al.  Record linkage with uniqueness constraints and erroneous values , 2010, Proc. VLDB Endow..

[16]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[17]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[18]  Meihui Zhang,et al.  InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables , 2013, SIGMOD '13.

[19]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[20]  Lorenzo Blanco,et al.  Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources , 2010, CAiSE.

[21]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[22]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[23]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[24]  Beng Chin Ooi,et al.  Online data fusion , 2011, Proc. VLDB Endow..

[25]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2006, J. Algorithms.

[27]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[28]  Jayant Madhavan,et al.  Structured Data on the Web , 2009, 2010 12th International Asia-Pacific Web Conference.

[29]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[30]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[31]  AnHai Doan,et al.  Matching Schemas in Online Communities: A Web 2.0 Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[32]  Erhard Rahm,et al.  Schema Matching and Mapping , 2013, Schema Matching and Mapping.

[33]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[34]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[35]  David Maier,et al.  A first tutorial on dataspaces , 2008, Proc. VLDB Endow..

[36]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[37]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[38]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[39]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[40]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[41]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[42]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[43]  Daisy Zhe Wang,et al.  Uncovering the Relational Web , 2008, WebDB.

[44]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[45]  Patrick Valduriez,et al.  Web Data Management , 2019, Principles of Distributed Database Systems.

[46]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[47]  Charu C. Aggarwal,et al.  Mining collective intelligence in diverse groups , 2013, WWW.

[48]  Hector Garcia-Molina,et al.  Entity resolution with evolving rules , 2010, Proc. VLDB Endow..

[49]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[50]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[51]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[52]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[53]  Gianluca Demartini,et al.  Large-scale linked data integration using probabilistic reasoning and crowdsourcing , 2013, The VLDB Journal.

[54]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[55]  Divesh Srivastava,et al.  Approximate String Processing , 2011, Found. Trends Databases.

[56]  Altigran Soares da Silva,et al.  Unsupervised Information Extraction by Text Segmentation , 2013, SpringerBriefs in Computer Science.

[57]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[58]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[59]  Ariel Fuxman,et al.  Matching unstructured product offers to structured product specifications , 2011, KDD.

[60]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[61]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[62]  Divesh Srivastava,et al.  Characterizing and selecting fresh data sources , 2014, SIGMOD Conference.

[63]  Jeffrey F. Naughton,et al.  Tracking Entities in the Dynamic World: A Fast Algorithm for Matching Temporal Records , 2014, Proc. VLDB Endow..

[64]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[65]  Wei Zhang,et al.  From Data Fusion to Knowledge Fusion , 2014, Proc. VLDB Endow..

[66]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[67]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[68]  Andrew Borthwick,et al.  Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce , 2012 .

[69]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[70]  Hector Garcia-Molina,et al.  Incremental entity resolution on rules and data , 2014, The VLDB Journal.

[71]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[72]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[73]  Hector Garcia-Molina,et al.  Question Selection for Crowd Entity Resolution , 2013, Proc. VLDB Endow..

[74]  Dan Roth,et al.  Latent credibility analysis , 2013, WWW.

[75]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[76]  Divesh Srivastava,et al.  Linking temporal records , 2011, Frontiers of Computer Science.

[77]  Beng Chin Ooi,et al.  Distributed data management using MapReduce , 2014, CSUR.

[78]  Mauricio G. C. Resende,et al.  GRASP: basic components and enhancements , 2011, Telecommun. Syst..

[79]  Jiawei Han,et al.  A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources , 2012 .

[80]  Divesh Srivastava,et al.  Large-scale copy detection , 2011, SIGMOD '11.

[81]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[82]  Partha Pratim Talukdar,et al.  Automatically incorporating new sources in keyword search-based data integration , 2010, SIGMOD Conference.

[83]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[84]  Xiaoxin Yin,et al.  Semi-supervised truth discovery , 2011, WWW.

[85]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[86]  Shui-Lung Chuang,et al.  Integrating web query results: holistic schema matching , 2008, CIKM '08.

[87]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[88]  Anastasios Kementsietsidis,et al.  Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013 , 2013, SIGMOD Conference.

[89]  Claire Mathieu,et al.  Online Correlation Clustering , 2010, STACS.

[90]  Divesh Srivastava,et al.  Fusing data with correlations , 2014, SIGMOD Conference.

[91]  Alon Y. Halevy,et al.  Data Modeling in Dataspace Support Platforms , 2009, Conceptual Modeling: Foundations and Applications.

[92]  Divesh Srivastava,et al.  Scaling up copy detection , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[93]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[94]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[95]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[96]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[97]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[98]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[99]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[100]  Alon Y. Halevy,et al.  Indexing dataspaces , 2007, SIGMOD '07.

[101]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[102]  Felix Naumann,et al.  Data profiling revisited , 2014, SGMD.

[103]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[104]  Nilesh N. Dalvi,et al.  Crowdsourcing Algorithms for Entity Resolution , 2014, Proc. VLDB Endow..

[105]  Dan Roth,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Making Better Informed Trust Decisions with Generalized Fact-Finding , 2022 .

[106]  Serge Abiteboul,et al.  Complexity of answering queries using materialized views , 1998, PODS.

[107]  Divesh Srivastava,et al.  Summarizing Relational Databases , 2009, Proc. VLDB Endow..

[108]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[109]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[110]  Margaret Martonosi,et al.  ON CELLULAR , 2022 .

[111]  Hector Garcia-Molina,et al.  Developments in Generic Entity Resolution , 2011, IEEE Data Eng. Bull..

[112]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[113]  Laura M. Haas,et al.  Clio: Schema Mapping Creation and Data Exchange , 2009, Conceptual Modeling: Foundations and Applications.

[114]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[115]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[116]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[117]  Jeffrey F. Naughton,et al.  Modeling entity evolution for temporal record matching , 2014, SIGMOD Conference.

[118]  David Maier,et al.  Quarrying dataspaces: Schemaless profiling of unfamiliar information sources , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[119]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[120]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[121]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.