论文信息 - Active Learning for Duplicate Record Identification in Deep Web

Active Learning for Duplicate Record Identification in Deep Web

Active learning is important for duplicate record identification since manually identifying a suitable set of labeled examples is difficult. The imbalance data problem for duplicate record identification, wherein the number of non-matches samples far exceeds the number of matches samples, causes poor prediction performance for matches class. In this paper, we present a new active learning approach by taking the certainty, uncertainty and representativeness into account. Our method first trains two feature subspace classifiers and uses certainty classifier to generate a matches pool from which informative matches samples were selected for manual annotation by leveraging an uncertainty and density measurement, and meanwhile, non-matches samples are automatically labeled to reduce human annotation efforts. We include a detailed experimental evaluation on real-world data demonstrating the effectiveness of our algorithms.

[1] Chih-Jen Lin,et al. A Practical Guide to Support Vector Classication , 2008 .

[2] Anuradha Bhamidipaty,et al. Interactive deduplication using active learning , 2002, KDD.

[3] Shlomo Argamon,et al. Committee-Based Sample Selection for Probabilistic Classifiers , 1999, J. Artif. Intell. Res..

[4] Michael K. Bergman. White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[5] Ron Kohavi,et al. Data Mining Using MLC a Machine Learning Library in C++ , 1996, Int. J. Artif. Intell. Tools.

[6] Steve Hanneke,et al. A bound on the label complexity of agnostic active learning , 2007, ICML '07.

[7] Daphne Koller,et al. Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[8] Panagiotis G. Ipeirotis,et al. Duplicate Record Detection: A Survey , 2007 .

[9] Burr Settles,et al. Active Learning Literature Survey , 2009 .

[10] Guodong Zhou,et al. Active Learning for Imbalanced Sentiment Classification , 2012, EMNLP.

[11] Divesh Srivastava,et al. Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[12] Mikhail Bilenko and Raymond J. Mooney,et al. On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[13] Martin Bergman,et al. The deep web:surfacing the hidden value , 2000 .

[14] Walid G. Aref,et al. Databases deepen the Web , 2004, Computer.

[15] Surajit Chaudhuri,et al. Example-driven design of efficient record matching queries , 2007, VLDB.