A selectivity model for fragmented relations: applied in information retrieval

New application domains cause today's database sizes to grow rapidly, posing great demands on technology. Data fragmentation facilitates techniques (like distribution, parallelization. and main-memory computing) meeting these demands. Also, fragmentation might help to improve efficient processing of query types such as top N. Database design and query optimization require a good notion of the costs resulting from a certain fragmentation. Our mathematically derived selectivity model facilitates this. Once its two parameters have been computed based on the fragmentation, after each (though usually infrequent) update, our model can forget the data distribution, resulting in fast and quite good selectivity estimation. We show experimental verification for Zipfian distributed IR databases.

[1]  Henk Ernst Blok Database Optimization Aspects for Information Retrieval , 2002 .

[2]  Yossi Matias,et al.  Bifocal sampling for skew-resistant join size estimation , 1996, SIGMOD '96.

[3]  M. L. Kersten,et al.  A framework for multi query optimization , 1997 .

[4]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[5]  Amani A. Saad,et al.  On multi-query optimization , 1996 .

[6]  Arjen P. de Vries,et al.  The Mirror DBMS at TREC-8 , 1999, TREC.

[7]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[8]  Michael J. Carey,et al.  Reducing the Braking Distance of an SQL Query Engine , 1998, VLDB.

[9]  Djoerd Hiemstra,et al.  Predicting the cost-quality trade-off for information retrieval queries: facilitating database design and query optimization , 2001, CIKM '01.

[10]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[11]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[12]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[13]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[14]  Henk M. Blanken,et al.  Estimating bucket accesses: A practical approach , 1986, 1986 IEEE Second International Conference on Data Engineering.

[15]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[16]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[17]  Luis Gravano,et al.  Optimizing top-k selection queries over multimedia repositories , 2004, IEEE Transactions on Knowledge and Data Engineering.

[18]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[19]  Raghu Ramakrishnan,et al.  Probabilistic Optimization of Top N Queries , 1999, VLDB.

[20]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[21]  S. B. Yao,et al.  Approximating block accesses in database organizations , 1977, CACM.

[22]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[23]  Peter M. G. Apers,et al.  A selectivity model for fragmented relations in information retrieval , 2001 .