Computing Iceberg Queries E ciently

Many applications compute aggregate functions over an attribute (or set of attributes) to nd aggregate values above some speci ed threshold. We call such queries iceberg queries, because the number of abovethreshold results is often very small (the tip of an iceberg), relative to the large amount of input data (the iceberg). Such iceberg queries are common in many applications, including data warehousing, information-retrieval, market basket analysis in data mining, clustering and copy detection. We propose e cient algorithms to evaluate iceberg queries using very little memory and signi cantly fewer passes over data, when compared to current techniques that use sorting or hashing. We present an experimental case study using over three gigabytes of Web data to illustrate the savings obtained by our algorithms.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[3]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[4]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5]  Jeffrey D. Uuman Principles of database and knowledge- base systems , 1989 .

[6]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[7]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[8]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[9]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[10]  T. J. Watson,et al.  An E ective Hash-Based Algorithm for Mining Association RulesJong , 1995 .

[11]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[12]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[13]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[14]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[15]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[16]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[17]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[18]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.