Assessing relevance and trust of the deep web sources and results based on inter-source agreement

Deep web search engines face the formidable challenge of retrieving high-quality results from the vast collection of searchable databases. Deep web search is a two-step process of selecting the high-quality sources and ranking the results from the selected sources. Though there are existing methods for both the steps, they assess the relevance of the sources and the results using the query-result similarity. When applied to the deep web these methods have two deficiencies. First is that they are agnostic to the correctness (trustworthiness) of the results. Second, the query-based relevance does not consider the importance of the results and sources. These two considerations are essential for the deep web and open collections in general. Since a number of deep web sources provide answers to any query, we conjuncture that the agreements between these answers are helpful in assessing the importance and the trustworthiness of the sources and the results. For assessing source quality, we compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for the possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source, that we call SourceRank, is calculated as the stationary visit probability of a random walk. For ranking results, we analyze the second-order agreement between the results. Further extending SourceRank to multidomain search, we propose a source ranking sensitive to the query domains. Multiple domain-specific rankings of a source are computed, and these ranks are combined for the final ranking. We perform extensive evaluations on online and hundreds of Google Base sources spanning across domains. The proposed result and source rankings are implemented in the deep web search engine Factal. We demonstrate that the agreement analysis tracks source corruption. Further, our relevance evaluations show that our methods improve precision significantly over Google Base and the other baseline methods. The result ranking and the domain-specific source ranking are evaluated separately.

[1]  Subbarao Kambhampati,et al.  A frequency-based approach for mining coverage statistics in data integration , 2004, Proceedings. 20th International Conference on Data Engineering.

[2]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2008, IEEE Trans. Knowl. Data Eng..

[3]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[4]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[5]  Raju Balakrishnan,et al.  Trust and Profit Sensitive Ranking for the Deep Web and On-line Advertisements , 2012 .

[6]  Milad Shokouhi,et al.  Federated text retrieval from uncooperative overlapped collections , 2007, SIGIR.

[7]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[9]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[10]  Jiawei Han,et al.  Heterogeneous network-based trust analysis: a survey , 2011, SKDD.

[11]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[12]  Tao Tao,et al.  Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[13]  Subbarao Kambhampati,et al.  SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement , 2010, WWW.

[14]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[15]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[16]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[17]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[18]  Norbert Fuhr,et al.  A decision-theoretic approach to database selection in networked IR , 1999, TOIS.

[19]  Subbarao Kambhampati,et al.  SMARTINT: using mined attribute dependencies to integrate fragmented web databases , 2011, Journal of Intelligent Information Systems.

[20]  Surajit Chaudhuri,et al.  Exploiting web search engines to search structured databases , 2009, WWW '09.

[21]  Yizhou Sun,et al.  Trust analysis with clustering , 2011, WWW.

[22]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[23]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[24]  Gerhard Weikum,et al.  Probabilistic Ranking of Database Query Results , 2004, VLDB.

[25]  David F. Gleich,et al.  Tracking the random surfer: empirically measured teleportation parameters in PageRank , 2010, WWW '10.

[26]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[27]  Wei-Ying Ma,et al.  Instance-based Schema Matching for Web Databases by Domain-specific Query Probing , 2004, VLDB.

[28]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[29]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[30]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[31]  Xiaoxin Yin,et al.  Semi-supervised truth discovery , 2011, WWW.

[32]  Juliana Freire,et al.  Organizing Hidden-Web Databases by Clustering Visible Web Documents , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[33]  Jayant Madhavan,et al.  Structured Data Meets the Web: A Few Observations , 2006, IEEE Data Eng. Bull..

[34]  Alex Wright Searching the deep web , 2008, Commun. ACM.

[35]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[36]  Gerhard Weikum,et al.  Improving collection selection with overlap awareness in P2P search engines , 2005, SIGIR '05.

[37]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[38]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[39]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[40]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[41]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[42]  Dan Klein,et al.  Agreement-Based Learning , 2007, NIPS.

[43]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[44]  Subbarao Kambhampati,et al.  Query processing over incomplete autonomous databases: query rewriting using learned data dependencies , 2009, The VLDB Journal.

[45]  Luis Gravano,et al.  When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[46]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[47]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[48]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[49]  Subbarao Kambhampati,et al.  Factal: integrating deep web based on trust and relevance , 2011, WWW.

[50]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[51]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[52]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[53]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .