Combining Structured and Unstructured Information Sources for a Study of Data Quality: A Case Study of Zillow.Com

Zillow is a web-based, leading real-estate information service in the US. We studied user-contributed facts in a sample of Zillow records. User-contributed information seems to improve the completeness and the level of detail of the information on Zillow.com. However, the accuracy of user-contributed facts may not be high. An investigation of the sources of error revealed several weaknesses, including conceptual challenges, information integration failures, and design deficiencies. A lack of shared, user-friendly, conceptual foundation has been found to be a significant drawback. In part, errors are a product of Zillow's wide geographic coverage and highly networked operation. In addition, important peculiarities of a property are often unknown to the public. Information about such peculiarities is typically shared by a small group of people, whose levels of expertise and stakes in that property, and in real estate in general, may differ. This environment poses a challenge for harnessing the collective intelligence. The results demonstrate the success of our unique evaluation strategy, which utilizes a systematic review of a rich set of online sources. A similar strategy may also be useful for large-scale error detection and correction, if an efficient automated equivalent is developed to implement it.

[1]  Sudha Ram,et al.  Who does what: Collaboration patterns in the wikipedia and their impact on data quality , 2009, International Conference on Wireless Information Technology and Systems.

[2]  Ronald C. Rutherford,et al.  Zillow ’ s Estimates of Single-Family Housing Values , 2010 .

[3]  InduShobha N. Chengalur-Smith,et al.  Sample-based quality estimation of query results in relational database environments , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  Ulrich Güntzer,et al.  Data Quality Mining - Making a Virute of Necessity , 2001, DMKD.

[5]  Barrie Gunter,et al.  Blogs, news and credibility , 2009, Aslib Proc..

[6]  Sheizaf Rafaeli,et al.  Predictors of answer quality in online Q&A sites , 2008, CHI.

[7]  Kenneth C. Laudon,et al.  Data quality and due process in large interorganizational record systems , 1986, CACM.

[8]  Rich Gazan Specialists and synthesists in a question answering community , 2006, ASIST.

[9]  Richard C. Morey,et al.  Estimating and improving the quality of information in a MIS , 1982, CACM.

[10]  Ananth Raman,et al.  Inventory Record Inaccuracy: An Empirical Analysis , 2008, Manag. Sci..

[11]  Tao Zhang,et al.  A Methodology for Establishing Information Quality Baselines for Complex, Distributed Systems , 2005, ICIQ.

[12]  Andrew Lih,et al.  Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource , 2004 .

[13]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[14]  Matthias Jarke,et al.  Systematic Development of Data Mining-Based Data Quality Tools , 2003, VLDB.

[15]  Sean W. Smith,et al.  Quality in Internet Collective Goods : Zealots and Good Samaritans in the Case of Wikipedia , 2005 .

[16]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[17]  Benjamin Edelman,et al.  Earnings and Ratings at Google Answers , 2012 .

[18]  Tim O'Reilly,et al.  Web Squared: Web 2.0 Five Years On , 2009 .

[19]  Donald P. Ballou,et al.  Designing Information Systems to Optimize the Accuracy-Timeliness Tradeoff , 1995, Inf. Syst. Res..

[20]  E. Bernstam,et al.  Accuracy and self correction of information received from an internet breast cancer list: content analysis , 2006, BMJ : British Medical Journal.

[21]  Ningning Wu,et al.  How Consistent is Web Information - A Case Study on Online Real Estate Databases , 2009, AMCIS.

[22]  Les Gasser,et al.  Information quality work organization in wikipedia , 2008, J. Assoc. Inf. Sci. Technol..

[23]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[24]  Gilad Ravid,et al.  How social motivation enhances economic activity and incentives in the Google Answers knowledge sharing market , 2007, Int. J. Knowl. Learn..

[25]  Daniel Yankelevich,et al.  Quality Mining A Data Mining Based Method for Data Quality Evaluation , 2003 .

[26]  Jack E. Olson,et al.  Data Quality: The Accuracy Dimension , 2003 .

[27]  Karen A. Brown,et al.  Predicting inventory record-keeping errors with discriminant analysis: A field experiment , 1993 .