Ontology-based data quality framework for data stream applications

Data Stream Management Systems (DSMS) have been proposed to address the challenges of applications which produce continuous, rapid streams of data that have to be processed in real-time. Data quality (DQ) plays an important role in DSMS as there is usually a trade-off between accuracy and consistency on the one hand, and timeliness and completeness on the other hand. Previous work on data quality in DSMS has focused only on specific aspects of DQ. In this paper, we present a flexible, holistic ontology-based data quality framework for data stream applications. Our DQ model is based on a threefold notion of DQ. First, content-based evaluation of DQ uses semantic rules which can be user- defined in an extensible ontology. Second, query-based evaluation adds DQ information to the query results and updates it while queries are being processed. Third, the application-based evaluation can use any kind of function which computes an application-specific DQ value. The whole DQ process is driven by the metadata managed in an ontology which provides a semantically clear definition of the DQ features of the DSMS. The evaluation of our approach in two case studies in the domain of traffic information systems has shown that our framework provides the required flexibility, extensibility, and performance for DQ management in DSMS.

[1]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[2]  Werner Retschitzegger,et al.  Improving Situation Awareness In Traffic Management , 2010 .

[3]  Sven Schmidt,et al.  Quality of service aware data stream processing , 2007 .

[4]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[5]  Bernhard Seeger,et al.  PIPES: a public infrastructure for processing and exploring streams , 2004, SIGMOD '04.

[6]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[7]  Gustavo Alonso,et al.  Declarative Support for Sensor Data Cleaning , 2006, Pervasive.

[8]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[9]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[10]  Karl Aberer,et al.  A middleware for fast and flexible sensor network deployment , 2006, VLDB.

[11]  Wolfgang Lehner,et al.  QStream: Deterministic Querying of Data Streams , 2004, VLDB.

[12]  Christopher Ré,et al.  Probabilistic databases , 2011, SIGA.

[13]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[14]  Wolfgang Lehner,et al.  Representing Data Quality in Sensor Data Streaming Environments , 2009, JDIQ.

[15]  Anna Liu,et al.  PODS: a new model and processing algorithms for uncertain data streams , 2010, SIGMOD Conference.

[16]  Amol Deshpande,et al.  Ef?cient Query Evaluation over Temporally Correlated Probabilistic Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[18]  Christian Facchi,et al.  How can Car2X-Communication improve road safety : a statistical based selection and discussion of feasible scenarios , 2009 .

[19]  Matthias Jarke,et al.  Architecture and Quality in Data Warehouses: An Extended Repository Approach , 1999, Information Systems.

[20]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.