论文信息 - Data Curation at Scale: The Data Tamer System

Data Curation at Scale: The Data Tamer System

Data curation is the act of discovering a data source(s) of interest, cleaning and transforming the new data, semantically integrating it with other local data sources, and deduplicating the resulting composite. There has been much research on the various components of curation (especially data integration and deduplication). However, there has been little work on collecting all of the curation components into an integrated end-to-end system. In addition, most of the previous work will not scale to the sizes of problems that we are nding in the eld. For example, one web aggregator requires the curation of 80,000 URLs and a second biotech company has the problem of curating 8000 spreadsheets. At this scale, data curation cannot be a manual (human) eort, but must entail machine learning approaches with a human assist only when necessary. This paper describes Data Tamer, an end-to-end curation system we have built at M.I.T. Brandeis, and Qatar Computing Research Institute (QCRI). It expects as input a sequence of data sources to add to a composite being constructed over time. A new source is subjected to machine learning algorithms to perform attribute identication, grouping of attributes into tables, transformation of incoming data and deduplication. When necessary, a human can be asked for guidance. Also, Data Tamer includes a data visualization component so a human can examine a data source at will and specify manual transformations. We have run Data Tamer on three real world enterprise curation problems, and it has been shown to lower curation cost by about 90%, relative to the currently deployed production software.

[1] Daisy Zhe Wang,et al. WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[2] Andrew McCallum,et al. Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[3] Peter Christen,et al. Febrl - Freely extensible biomedical record linkage , 2002 .

[4] Erhard Rahm,et al. A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[5] W. Winkler. Overview of Record Linkage and Current Research Directions , 2006 .

[6] Claire Mathieu,et al. Online Correlation Clustering , 2010, STACS.

[7] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[8] Phokion G. Kolaitis,et al. Semi-Automatic Schema Integration in Clio , 2007, VLDB.

[9] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[10] Joseph M. Hellerstein,et al. Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[11] Peter Christen,et al. A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[12] Rajeev Motwani,et al. Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).