Solving Large Scale Instances of the Distribution Design Problem Using Data Mining

In this paper we approach the solution of large instances of the distribution design problem. The traditional approaches do not consider that the instance size can significantly reduce the efficiency of the solution process. We propose a new approach that includes compression methods to transform the original instance into a new one using data mining techniques. The goal of the transformation is to condense the operation access pattern of the original instance to reduce the amount of resources needed to solve the original instance, without significantly reducing the quality of its solution. In order to validate the approach, we tested it proposing two instance compression methods on a new model of the replicated version of the distribution design problem that incorporates generalized database objects. The experimental results show that our approach permits to reduce the computational resources needed for solving large instances by at least 65%, without significantly reducing the quality of its solution. Given the encouraging results, at the moment we are working on the design and implementation of efficient instance compression methods using other data mining techniques.

[1]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[2]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[3]  Dr. Zbigniew Michalewicz,et al.  How to Solve It: Modern Heuristics , 2004 .

[4]  Jesper M. Johansson,et al.  The effects of parallel processing on update response time in distributed database design , 2000, ICIS.

[5]  Joaquín Pérez Ortega,et al.  Self-Tuning Mechanism for Genetic Algorithms Parameters, an Application to Data-Object Allocation in the Web , 2004, ICCSA.

[6]  Shamkant B. Navathe,et al.  Distribution Design of Logical Database Schemas , 1983, IEEE Transactions on Software Engineering.

[7]  Sam Lightstone,et al.  DB2 Design Advisor: Integrated Automatic Physical Database Design , 2004, VLDB.

[8]  Surajit Chaudhuri,et al.  Database Tuning Advisor for Microsoft SQL Server 2005 , 2004, VLDB.

[9]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[10]  Constantine Stamatopoulos Observations on the geometrical properties of accuracy growth in sampling with finite populations , 1999 .

[11]  Marta Mattoso,et al.  A Distribution Design Methodology for Object DBMS , 2004, Distributed and Parallel Databases.

[12]  Yin-Fu Huang,et al.  Fragment Allocation in Distributed Database Design , 2001, J. Inf. Sci. Eng..

[13]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[14]  Chris Jermaine,et al.  Online maintenance of very large random samples , 2004, SIGMOD '04.

[15]  Surajit Chaudhuri,et al.  Compressing SQL workloads , 2002, SIGMOD '02.

[16]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[17]  Catalin Visinescu,et al.  Incremental Data Distribution on Internet-Based Distributed Systems : A Spring System Approach , 2003 .

[18]  Peter M G Apers,et al.  Data allocation in distributed database systems , 1988, TODS.

[19]  Joaquín Pérez Ortega,et al.  Vertical Fragmentation and Allocation in Distributed Databases with Site Capacity Restrictions Using the Threshold Accepting Algorithm , 2000, MICAI.

[20]  Anastasia Ailamaki,et al.  AutoPart: automating schema design for large scientific databases using data partitioning , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[21]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[22]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[23]  Jeffrey Scott Vitter,et al.  An efficient algorithm for sequential random sampling , 1987, TOMS.