Result selection and summarization for Web Table search

The amount of information available on the Web has been growing dramatically, raising the importance of techniques for searching the Web. Recently, Web Tables emerged as a model, which enables users to search for information in a structured way. However, effective presentation of results for Web Table search requires (1) selecting a ranking of tables that acknowledges the diversity within the search result; and (2) summarizing the information content of the selected tables concisely but meaningful. In this paper, we formalize these requirements as the diversified table selection problem and the structured table summarization problem. We show that both problems are computationally intractable and, thus, present heuristic algorithms to solve them. For these algorithms, we prove salient performance guarantees, such as near-optimality, stability, and fairness. Our experiments with real-world collections of thousands of Web Tables highlight the scalability of our techniques. We achieve improvements up to 50% in diversity and 10% in relevance over baselines for Web Table selection, and reduce the information loss induced by table summarization by up to 50%. In a user study, we observed that our techniques are preferred over alternative solutions.

[1]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[2]  Avigdor Gal,et al.  On the Stable Marriage of Maximum Weight Royal Couples , 2007 .

[3]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[4]  Evaggelia Pitoura,et al.  Search result diversification , 2010, SGMD.

[5]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[6]  DalviNilesh,et al.  An analysis of structured data on the web , 2012, VLDB 2012.

[7]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[8]  Gábor J. Székely,et al.  Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method , 2005, J. Classif..

[9]  Divesh Srivastava,et al.  Summarizing Relational Databases , 2009, Proc. VLDB Endow..

[10]  Richard C. Dubes,et al.  Stability of a hierarchical clustering , 1980, Pattern Recognit..

[11]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[12]  Divesh Srivastava,et al.  On query result diversification , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[13]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[14]  Christian S. Jensen,et al.  Google fusion tables: web-centered data management and collaboration , 2010, SIGMOD Conference.

[15]  Vipin Kumar,et al.  Summarization - compressing data into an informative representation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[16]  Filip Radlinski,et al.  Redundancy, diversity and interdependent document relevance , 2009, SIGF.

[17]  Dan Crow Google Squared: web scale, open domain information extraction and presentation , 2010 .

[18]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[19]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[20]  Michael Langberg,et al.  The Dense k Subgraph problem , 2009, ArXiv.

[21]  Evaggelia Pitoura,et al.  DisC diversity: result diversification based on dissimilarity and coverage , 2012, Proc. VLDB Endow..

[22]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[23]  Ümit V. Çatalyürek,et al.  Diversified recommendation on graphs: pitfalls, measures, and algorithms , 2013, WWW.

[24]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[25]  Jati K. Sengupta,et al.  Introduction to Information , 1993 .

[26]  Beng Chin Ooi,et al.  A hybrid machine-crowdsourcing system for matching web tables , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[27]  Uriel Feige,et al.  The Dense k -Subgraph Problem , 2001, Algorithmica.

[28]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[29]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[30]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[31]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[32]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[33]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[34]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[35]  K. Selçuk Candan,et al.  Reducing metadata complexity for faster table summarization , 2010, EDBT '10.

[36]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[37]  Meihui Zhang,et al.  InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables , 2013, SIGMOD '13.

[38]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[39]  Luis Gravano,et al.  Top-k selection queries over relational databases: Mapping strategies and performance evaluation , 2002, TODS.

[40]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[41]  MengWeiyi,et al.  Truth finding on the deep web , 2012, VLDB 2012.

[42]  Hua Li,et al.  Improving web search results using affinity graph , 2005, SIGIR '05.

[43]  Jingrui He,et al.  GenDeR: A Generic Diversified Ranking Algorithm , 2012, NIPS.

[44]  Klaudia Frankfurter Computers And Intractability A Guide To The Theory Of Np Completeness , 2016 .

[45]  Erhard Rahm,et al.  Generic schema matching, ten years later , 2011, Proc. VLDB Endow..