Predicting the quality of health web documents using their characteristics

Purpose The quality of consumer-oriented health information on the web has been defined and evaluated in several studies. Usually it is based on evaluation criteria identified by the researchers and, so far, there is no agreed standard for the quality indicators to use. Based on such indicators, tools have been developed to evaluate the quality of web information. The HONcode is one of such tools. The purpose of this paper is to investigate the influence of web document features on their quality, using HONcode as ground truth, with the aim of finding whether it is possible to predict the quality of a document using its characteristics. Design/methodology/approach The present work uses a set of health documents and analyzes how their characteristics (e.g. web domain, last update, type, mention of places of treatment and prevention strategies) are associated with their quality. Based on these features, statistical models are built which predict whether health-related web documents have certification-level quality. Multivariate analysis is performed, using classification to estimate the probability of a document having quality given its characteristics. This approach tells us which predictors are important. Three types of full and reduced logistic regression models are built and evaluated. The first one includes every feature, without any exclusion, the second one disregards the Utilization Review Accreditation Commission variable, due to it being a quality indicator, and the third one excludes the variables related to the HONcode principles, which might also be indicators of quality. The reduced models were built with the aim to see whether they reach similar results with a smaller number of features. Findings The prediction models have high accuracy, even without including the characteristics of Health on the Net code principles in the models. The most informative prediction model considers characteristics that can be assessed automatically (e.g. split content, type, process of revision and place of treatment). It has an accuracy of 89 percent. Originality/value This paper proposes models that automatically predict whether a document has quality or not. Some of the used features (e.g. prevention, prognosis or treatment) have not yet been explicitly considered in this context. The findings of the present study may be used by search engines to promote high-quality documents. This will improve health information retrieval and may contribute to reduce the problems caused by inaccurate information.

[1]  M Pallen,et al.  Guide to the Internet: The world wide web , 1995, BMJ.

[2]  P. Impicciatore,et al.  Reliability of health information for the public on the world wide web: systematic survey of advice on managing fever in children at home , 1997, BMJ.

[3]  G D Lundberg,et al.  Assessing, controlling, and assuring the quality of medical information on the Internet: Caveant lector et viewor--Let the reader and viewer beware. , 1997, JAMA.

[4]  A R Jadad,et al.  Rating health information on the Internet: navigating to knowledge or to Babel? , 1998, JAMA.

[5]  William R. Hersh,et al.  Filtering Web pages for quality indicators: an empirical approach to finding high quality consumer health information on the World Wide Web , 1999, AMIA.

[6]  R. Kravitz,et al.  Health information on the Internet: accessibility, quality, and readability in English and Spanish. , 2001, JAMA.

[7]  Christian Köhler,et al.  How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews , 2002, BMJ : British Medical Journal.

[8]  J. Powell,et al.  Empirical studies assessing the quality of health information for consumers on the world wide web: a systematic review. , 2002, JAMA.

[9]  J. Burkell,et al.  Health Information Seals of Approval: What do they Signify? , 2004 .

[10]  Elmer V. Bernstam,et al.  Instruments to assess the quality of health information on the World Wide Web: what can our patients actually use? , 2005, Int. J. Medical Informatics.

[11]  F. Pérez-López,et al.  Assessing the content and quality of information on the treatment of postmenopausal osteoporosis on the World Wide Web , 2006, Gynecological endocrinology : the official journal of the International Society of Gynecological Endocrinology.

[12]  Natalia Grabar,et al.  Machine Learning Approach for Automatic Quality Criteria Detection of Health Web Pages , 2007, MedInfo.

[13]  Natalia Grabar,et al.  Automatic Retrieval of Web Pages with Standards of Ethics and Trustworthiness Within a Medical Portal: What a Page Name Tells Us , 2007, AIME.

[14]  Reijo Savolainen,et al.  Source preferences in the context of seeking problem-specific information , 2008, Inf. Process. Manag..

[15]  Jeonghyun Kim Describing and predicting information-seeking behavior on the Web , 2009 .

[16]  Michaël Laurent,et al.  Research Paper: Seeking Health Information Online: Does Wikipedia Matter? , 2009, J. Am. Medical Informatics Assoc..

[17]  Jeonghyun Kim,et al.  Describing and predicting information-seeking behavior on the Web , 2009, J. Assoc. Inf. Sci. Technol..

[18]  Carla Teixeira Lopes,et al.  Context effect on query formulation and subjective relevance in health searches , 2010, IIiX.

[19]  James M Heilman,et al.  Wikipedia: A Key Tool for Global Public Health Promotion , 2011, Journal of medical Internet research.

[20]  Antoine Geissbühler,et al.  Evolution of Health Web certification through the HONcode experience , 2011, MIE.

[21]  Ashish Joshi,et al.  Evaluation of dengue-related health information on the internet. , 2012, Perspectives in health information management.

[22]  Fatemeh Zahedi,et al.  Detecting Fake Medical Web Sites Using Recursive Trust Labeling , 2012, TOIS.

[23]  ChengXiang Zhai,et al.  Reliability Prediction of Webpages in the Medical Domain , 2012, ECIR.

[24]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[25]  Carla Teixeira Lopes,et al.  Measuring the value of health query translation: An analysis by user language proficiency , 2013, J. Assoc. Inf. Sci. Technol..

[26]  E. Fahy,et al.  Quality of patient health information on the Internet: reviewing a complex and evolving landscape. , 2014, The Australasian medical journal.

[27]  H. Potts,et al.  Motivations for Contributing to Health-Related Articles on Wikipedia: An Interview Study , 2013, Journal of medical Internet research.

[28]  Carla Teixeira Lopes,et al.  The Influence of Documents, Users and Tasks on the Relevance and Comprehension of Health Web Documents , 2015 .

[29]  Yan Zhang,et al.  Quality of health information for consumers on the web: A systematic review of indicators, criteria, tools, and evaluation results , 2015, J. Assoc. Inf. Sci. Technol..

[30]  Gilles Falquet,et al.  Language Independent Tokenization vs. Stemming in Automated Detection of Health Websites' HONcode Conformity: An Evaluation , 2015, CENTERIS/ProjMAN/HCist.

[31]  Nicholas J. Belkin,et al.  Predicting users' domain knowledge in information retrieval using multiple regression analysis of search behaviors , 2015, J. Assoc. Inf. Sci. Technol..

[32]  Célia Boyer,et al.  Automated Detection of HONcode Website Conformity Compared to Manual Detection: An Evaluation , 2015, Journal of medical Internet research.

[33]  Juliana Genova,et al.  Communication AssessmenT Checklist in Health: Assessment and Comparison of Web-Based Health Resources , 2016 .

[34]  Glenn Regehr,et al.  Quality of Online Resources for Pancreatic Cancer Patients , 2017, Journal of Cancer Education.

[35]  Allan Hanbury,et al.  How to sort trustworthy health online information? Improvements of the automated detection of HONcode criteria , 2017 .

[36]  C. Carson,et al.  Readability, credibility and quality of patient information for hypogonadism and testosterone replacement therapy on the Internet , 2017, International Journal of Impotence Research.