论文信息 - Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data

Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data

Entity Resolution constitutes a core task for data integration that, due to its quadratic complexity, typically scales to large datasets through blocking methods. These can be configured in two ways. The schema-based configuration relies on schema information in order to select signatures of high distinctiveness and low noise, while the schema-agnostic one treats every token from all attribute values as a signature. The latter approach has significant potential, as it requires no fine-tuning by human experts and it applies to heterogeneous data. Yet, there is no systematic study on its relative performance with respect to the schema-based configuration. This work covers this gap by comparing analytically the two configurations in terms of effectiveness, time efficiency and scalability. We apply them to 9 established blocking methods and to 11 benchmarks of structured data. We provide valuable insights into the internal functionality of the blocking methods with the help of a novel taxonomy. Our studies reveal that the schema-agnostic configuration offers unsupervised and robust definition of blocking keys under versatile settings, trading a higher computational cost for a consistently higher recall than the schema-based one. It also enables the use of state-of-the-art blocking methods without schema knowledge.

[1] Peter Christen,et al. A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[2] Peter Christen,et al. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[3] Jayant Madhavan,et al. Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[4] Claudia Niederée,et al. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data , 2012, WSDM '12.

[5] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6] Ashwin Machanavajjhala,et al. Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[7] Raymond J. Mooney,et al. Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[8] Ashwin Machanavajjhala,et al. Network sampling , 2013, KDD.

[9] Felix Naumann,et al. A Comparison and Generalization of Blocking and Windowing Algorithms for Duplicate Detection , 2009 .

[10] Robert Isele,et al. Efficient Multidimensional Blocking for Link Discovery without losing Recall , 2011, WebDB.

[11] Jayant Madhavan,et al. Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[12] Peter Christen,et al. Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[13] Keizo Oyama,et al. A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[14] Claudia Niederée,et al. A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15] Wen-Syan Li,et al. String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[16] P. Patel-Schneider. Towards Large-scale Schema And Ontology Matching , 2015 .

[17] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[18] Divesh Srivastava,et al. Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[19] Data Matching , 2017, Encyclopedia of Machine Learning and Data Mining.

[20] Felix Naumann,et al. DuDe: The Duplicate Detection Toolkit , 2010 .

[21] Salvatore J. Stolfo,et al. The merge/purge problem for large databases , 1995, SIGMOD '95.

[22] Craig A. Knoblock,et al. Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[23] Chen Li,et al. Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[24] George Papastefanatos,et al. Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking , 2016, Big Data Res..

[25] Georgia Koutrika,et al. Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[26] Felix Naumann,et al. Adaptive Windows for Duplicate Detection , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[27] Andrew McCallum,et al. Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[28] Wolfgang Nejdl,et al. Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[29] P. Ivax,et al. A THEORY FOR RECORD LINKAGE , 2004 .

[30] Sören Auer,et al. LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.