SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

We consider the problem of deep web source selection and argue that existing source selection methods are inadequate as they are based on local similarity assessment. Specically, they fail to account for the fact that sources can vary in trustworthiness and individual results can vary in importance. In response, we formulate a global measure to calculate relevance and trustworthiness of a source based on agreement between the answers provided by different sources. Agreement is modeled as a graph with sources at the vertices. On this agreement graph, source quality scores - namely SourceRank - are calculated as the stationary visit probability of a weighted random walk. Our experiments on online databases and 675 book sources from Google Base show that SourceRank improves relevance of the results by 25-40% over existing methods and Google Base ranking. SourceRank also reduces linearly with the corruption levels of the sources.

[1]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[2]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[3]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[6]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[7]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[8]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[9]  Relevant document distribution estimation method for resource selection , 2003, SIGIR.

[10]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[11]  Luis Gravano,et al.  When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[12]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[13]  Wei-Ying Ma,et al.  Instance-based Schema Matching for Web Databases by Domain-specific Query Probing , 2004, VLDB.

[14]  Subbarao Kambhampati,et al.  A frequency-based approach for mining coverage statistics in data integration , 2004, Proceedings. 20th International Conference on Data Engineering.

[15]  Gerhard Weikum,et al.  Improving collection selection with overlap awareness in P2P search engines , 2005, SIGIR '05.

[16]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[17]  Jayant Madhavan,et al.  Structured Data Meets the Web: A Few Observations , 2006, IEEE Data Eng. Bull..

[18]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[19]  Milad Shokouhi,et al.  Federated text retrieval from uncooperative overlapped collections , 2007, SIGIR.

[20]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[21]  Alex Wright Searching the deep web , 2008, Commun. ACM.

[22]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[23]  Surajit Chaudhuri,et al.  Exploiting web search engines to search structured databases , 2009, WWW '09.

[24]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[25]  David F. Gleich,et al.  Tracking the random surfer: empirically measured teleportation parameters in PageRank , 2010, WWW '10.

[26]  Subbarao Kambhampati,et al.  Factal: integrating deep web based on trust and relevance , 2011, WWW.