Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER)

Entity resolution identifies records that refer to the same real-world entity. For its classification step, supervised learning can be adopted, but this faces limitations in the availability of labeled training data. Under this situation, active learning has been proposed to gather labels while reducing the human labeling effort, by selecting the most informative data as candidates for labeling. Committee-based active learning is one of the most commonly used approaches, which chooses data with the most disagreement of voting results of the committee, considering this as the most informative data. However, the current state-of-the-art committee-based active learning approaches for entity resolution have two main drawbacks: First, the selected initial training data is usually not balanced and informative enough. Second, the committee is formed with homogeneous classifiers by comprising their accuracy to achieve diversity of the committee, i.e., the classifiers are not trained with all available training data or the best parameter setting. In this paper, we propose our committee-based active learning approach HeALER, which overcomes both drawbacks by using more effective initial training data selection approaches and a more effective heterogenous committee. We implemented HeALER and compared it with passive learning and other state-of-the-art approaches. The experiment results prove that our approach outperforms other state-of-the-art committee-based active learning approaches.

[1]  Jens Lehmann,et al.  RAVEN - active learning of link specifications , 2011, OM.

[2]  Prithviraj Sen,et al.  Active Learning for Large-Scale Entity Resolution , 2017, CIKM.

[3]  Axel-Cyrille Ngonga Ngomo,et al.  EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming , 2012, ESWC.

[4]  Gunter Saake,et al.  Cloud-Scale Entity Resolution: Current State and Open Challenges , 2018, Open J. Big Data.

[5]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[6]  Fabrizio Angiulli,et al.  Nearest Neighbor-Based Classification of Uncertain Data , 2013, TKDD.

[7]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[8]  Axel-Cyrille Ngonga Ngomo,et al.  COALA - Correlation-Aware Active Learning of Link Specifications , 2013, ESWC.

[9]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[10]  Qing Wang,et al.  Efficient Interactive Training Selection for Large-Scale Entity Resolution , 2015, PAKDD.

[11]  Aditya G. Parameswaran,et al.  Active Sampling for Entity Matching with Guarantees , 2013, TKDD.

[12]  Gisele L. Pappa,et al.  Active Learning Genetic programming for record deduplication , 2010, IEEE Congress on Evolutionary Computation.

[13]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[14]  Qing Wang,et al.  Active Learning Based Entity Resolution Using Markov Logic , 2016, PAKDD.

[15]  Yang Li,et al.  The Best of Both Worlds: Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution , 2019, BTW.

[16]  Aditya G. Parameswaran,et al.  Active sampling for entity matching , 2012, KDD.

[17]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[18]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[19]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[20]  Robert Isele,et al.  Active learning of expressive linkage rules using genetic programming , 2013, J. Web Semant..

[21]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[22]  Raymond J. Mooney,et al.  Diverse ensembles for active learning , 2004, ICML.

[23]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[24]  Xindong Wu,et al.  Active Learning with Adaptive Heterogeneous Ensembles , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[25]  Yannis Manolopoulos,et al.  An efficient and effective algorithm for density biased sampling , 2002, CIKM '02.

[26]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[27]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .