论文信息 - Crowd-sourcing Web knowledge for metadata extraction

Crowd-sourcing Web knowledge for metadata extraction

We explore a new metadata extraction framework without human annotators with the ground truth harvested from Web. A new training sample is selected based on not only the uncertainty and representativeness in the unlabeled pool, but also on its availability and credibility in Web knowledge bases. We construct a dataset of 4329 books with valid metadata and evaluate our approach using 5 Web book databases as oracles. Empirical results demonstrate its effectiveness and efficiency.

Wenyi Huang | Zhaohui Wu | C. Lee Giles | Chen Liang

[1] Mark Craven,et al. Curious machines: active learning with structured instances , 2008 .

[2] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[3] Dan Roth,et al. Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Making Better Informed Trust Decisions with Generalized Fact-Finding , 2022 .

[4] Jaime G. Carbonell,et al. Proactive learning: cost-sensitive active learning with multiple imperfect oracles , 2008, CIKM '08.

[5] Xindong Wu,et al. Self-Taught Active Learning from Crowds , 2012, 2012 IEEE 12th International Conference on Data Mining.

[6] Mark Craven,et al. Active Learning with Real Annotation Costs , 2008 .

[7] Andrew McCallum,et al. Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[8] William A. Gale,et al. A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[9] Jennifer G. Dy,et al. Active Learning from Crowds , 2011, ICML.

[10] Philip S. Yu,et al. Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11] Russell Greiner,et al. Optimistic Active-Learning Using Mutual Information , 2007, IJCAI.