Addressing Instance Ambiguity in Web Harvesting

Web Harvesting enables the enrichment of incomplete data sets by retrieving required information from the Web. However, the ambiguity of instances may greatly decrease the quality of the harvested data, given that any instance in the local data set may become ambiguous when attempting to identify it on the Web. Although plenty of disambiguation methods have been proposed to deal with the ambiguity problems in various settings, none of them are able to handle the instance ambiguity problem in Web Harvesting. In this paper, we propose to do instance disambiguation in Web Harvesting with a novel disambiguation method inspired by the idea of collaborative identity recognition. In particular, we expect to find some common properties in forms of latent shared attribute values among instances in the list, such that these shared attribute values can differentiate instances within the list against those ambiguous ones on the Web. Our extensive experimental evaluation illustrates the utility of collaborative disambiguation for a popular Web Harvesting application, and shows that it substantially improves the accuracy of the harvested data.

[1]  Subbarao Kambhampati,et al.  SMARTINT: using mined attribute dependencies to integrate fragmented web databases , 2011, Journal of Intelligent Information Systems.

[2]  Georgia Koutrika,et al.  A Unified User Profile Framework for Query Disambiguation and Personalization , 2005 .

[3]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[4]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[5]  Sofia Stamou,et al.  Web query disambiguation using PageRank , 2012, J. Assoc. Inf. Sci. Technol..

[6]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[7]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[8]  Claus Brabrand,et al.  WebSelF: A Web Scraping Framework , 2012, ICWE.

[9]  Eneko Agirre,et al.  Graph-based Word Sense Disambiguation of biomedical documents , 2010, Bioinform..

[10]  J.L. Marill,et al.  Tools and techniques for harvesting the World Wide Web , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[11]  Jun Zhao,et al.  Collective entity linking in web text: a graph-based method , 2011, SIGIR.

[12]  William W. Cohen,et al.  Automatic Set Instance Extraction using the Web , 2009, ACL/IJCNLP.

[13]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[14]  Marta Indulska,et al.  A web-based approach to data imputation , 2013, World Wide Web.

[15]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[16]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[17]  Edward Y. Chang,et al.  Entity Disambiguation with Freebase , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[18]  Mehmet A. Orgun,et al.  Multi-Constrained Graph Pattern Matching in large-scale contextual social graphs , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[19]  George Karypis,et al.  A novel two-box search paradigm for query disambiguation , 2011, World Wide Web.

[20]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[21]  Qing Li,et al.  Coalitional Game for Community-Based Autonomous Web Services Cooperation , 2013, IEEE Transactions on Services Computing.

[22]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[23]  M. Engelmann The Philosophical Investigations , 2013 .

[24]  Xiaoyong Du,et al.  CoRE: A Context-Aware Relation Extraction Method for Relation Completion , 2013, IEEE Transactions on Knowledge and Data Engineering.

[25]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[26]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[27]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[28]  Bamshad Mobasher,et al.  A Survey of Collaborative Recommendation and the Robustness of Model-Based Algorithms , 2008, IEEE Data Eng. Bull..

[29]  G. Murphy,et al.  The Big Book of Concepts , 2002 .

[30]  Mehmet A. Orgun,et al.  Optimal Social Trust Path Selection in Complex Social Networks , 2010, AAAI.