Data Quality in Web Information Systems

The World Wide Web has brought a wave of revolutionary changes for people and organizations to generate, disseminate and use data. With unprecedented access to massive amount of data and powerful information gathering capabilities enabled by Web-based technologies, the traditional closed world assumption for database systems has been challenged. More and more data from the Web are used today as essential information sources, directly or indirectly, for all types of decision making purposes in not only just personal, but also many business and scientific applications. A user of such Web data, however, has to constantly rely on their own judgement on data quality, such as correctness, currency, consistency and completeness. This is an unreliable and often very difficult process, as the quality of this judgement itself often relies on the quality of other information obtained from the Web, and the relationship among the data used can be very complex and sometime hidden from the user. While the issue of data quality is as old as data itself, it is now exposed at a much higher, broader and more critical level due to the scale, diversity and ubiquitousness of Web Information Systems. The intrinsic mismatch between the intended use and actual use of the data on the Web is a fundamental cause of poor data quality for Web-based applications. In this talk, we will introduce the notion of data quality, from its root in management information systems research to new issues and challenges in the context of large-scale Web Information Systems. After a brief introduction to organizational and architectural solutions to the data quality problem, this talk will focus on the current research activities and results on computational solutions form the database community in data profiling, record linking, conditional functional constraints, data provenance and data uncertainty. These technical solutions will be examined for their promises and limitations to the problem of data quality in Web Information Systems. Finally, we will discuss a list of open research problems.