DCUBE: CUBE on Dirty Databases

In the real world databases, dirty data such as inconsistent data, duplicate data affect the effectiveness of applications with database. It brings new challenges to efficiently process OLAP on the database with dirty data. CUBE is an important operator for OLAP. This paper proposes the CUBE operation based on overlapping clustering, and an effective and efficient storing and computing method for CUBE on the database with dirty data. Based on CUBE, this paper proposes efficient algorithms for answering aggregation queries, and the processing methods of other major operators for OLAP on the database with dirty data. Experimental results show the efficiency of the algorithms presented in this paper.

[1]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[2]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[3]  Peter J. Haas,et al.  Resolution-Aware Query Answering for Business Intelligence , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[5]  K. Selçuk Candan,et al.  FICSR: feedback-based inconsistency resolution and query processing on misaligned data sources , 2007, SIGMOD '07.

[6]  V. S. Subrahmanian,et al.  Aggregate Query Answering under Uncertain Schema Mappings , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[7]  Hui Xiong,et al.  Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[8]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[9]  Minos N. Garofalakis,et al.  Adaptive cleaning for RFID data streams , 2006, VLDB.

[10]  Philip S. Yu,et al.  A Sampling-Based Approach to Information Recovery , 2008, 2008 IEEE 24th International Conference on Data Engineering.