The quality of the XML Web

We collect evidence to answer the following question: Is the quality of the XML documents found on the Web sufficient to apply XML technology like XQuery, XPath and XSLT? XML collections from the Web have been previously studied statistically, but no detailed information about the quality of the XML documents on the Web is available to date. We address this shortcoming in this study. We gathered 180K XML documents from the Web. Their quality is surprisingly good; 85.4% are well-formed and 99.5% of all specified encodings is correct. Validity needs serious attention. Only 25% of all files contain a reference to a DTD or XSD, of which just one-third are actually valid. Well-formedness errors and validity errors are studied in detail. Our study is well-documented, easily repeatable and all data is publicly available [21], (Grijzenhout, 2010) [52]. This paves the way for a periodic quality assessment of the XML Web.

[1]  Dave J. Beckett 30% Accessible - A Survey of the UK Wide Web , 1997, Comput. Networks.

[2]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[3]  Byron Choi,et al.  What are real DTDs like? , 2002, WebDB.

[4]  Andreas Heuer,et al.  Metrics for XML Document Collections , 2002, EDBT Workshops.

[5]  Boris Motik,et al.  OWL 2: The next step for OWL , 2008, J. Web Semant..

[6]  Maarten Marx,et al.  Advanced Information Access to Parliamentary Debates , 2009, J. Digit. Inf..

[7]  Yannis Papakonstantinou,et al.  DTD inference for views of XML data , 2000, PODS.

[8]  Mario F. Triola,et al.  Essentials of Statistics , 2001 .

[9]  Irena Holubová,et al.  Statistical Analysis of Real XML Data Collections , 2006, COMAD.

[10]  Thomas Schwentick,et al.  Inference of concise DTDs from XML data , 2006, VLDB.

[11]  Andreas Hotho,et al.  Semantic Web Mining: State of the art and future directions , 2006, J. Web Semant..

[12]  Irene Pollach,et al.  Environmental websites: an empirical investigation of functionality and accessibility , 2006 .

[13]  Frank Neven,et al.  Learning deterministic regular expressions for the inference of schemas from XML data , 2008, WWW.

[14]  Shan Chen,et al.  An Experimental Study on Validation Problems with Existing HTML Webpages , 2005, International Conference on Internet Computing.

[15]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[16]  Horst Treiblmaier,et al.  Environmental Web Sites: An Empirical Investigation of Functionality and Accessibility , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[17]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[18]  Thomas Schwentick,et al.  Expressiveness and complexity of XML Schema , 2006, TODS.

[19]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[20]  Thomas Schwentick,et al.  Expressiveness of XSDs: from practice to theory, there and back again , 2005, WWW '05.

[21]  Stuart E. Madnick,et al.  Overview and Framework for Data and Information Quality Research , 2009, JDIQ.

[22]  Frank Neven,et al.  DTDs versus XML schema: a practical study , 2004, WebDB '04.

[23]  Thomas Schwentick,et al.  Which XML Schemas Admit 1-Pass Preorder Typing? , 2005, ICDT.

[24]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[25]  Vincent Yun Shen,et al.  Transforming web pages to become standard-compliant through reverse engineering , 2006, W4A '06.

[26]  Serge Abiteboul,et al.  Queries and computation on the web , 1997, Theor. Comput. Sci..

[27]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[28]  Denilson Barbosa,et al.  Studying the XML Web: Gathering Statistics from an XML Sample , 2005, World Wide Web.

[29]  Tim Furche,et al.  OXPath , 2011, Proc. VLDB Endow..

[30]  Maarten Marx,et al.  The quality of the XML web , 2011, CIKM '11.

[31]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[32]  Scott Dick,et al.  Prevalence and classification of web page defects , 2010, Online Inf. Rev..

[33]  Arnaud Sahuguet Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask , 2000, WebDB.

[34]  Frank Neven,et al.  Simplifying XML schema: single-type approximations of regular tree languages , 2010, J. Comput. Syst. Sci..

[35]  Eric van der Vlist,et al.  Relax NG , 2003 .

[36]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[37]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..