A balanced approach to health information evaluation: A vocabulary-based naïve Bayes classifier and readability formulas

Since millions seek health information online, it is vital for this information to be comprehensible. Most studies use readability formulas, which ignore vocabulary, and conclude that online health information is too difficult. We developed a vocabularly-based, naive Bayes classifier to distinguish between three difficulty levels in text. It proved 98% accurate in a 250-document evaluation. We compared our classifier with readability formulas for 90 new documents with different origins and asked representative human evaluators, an expert and a consumer, to judge each document. Average readability grade levels for educational and commercial pages was 10th grade or higher, too difficult according to current literature. In contrast, the classifier showed that 70-90% of these pages were written at an intermediate, appropriate level indicating that vocabulary usage is frequently appropriate in text considered too difficult by readability formula evaluations. The expert considered the pages more difficult for a consumer than the consumer did. © 2008 Wiley Periodicals, Inc.

[1]  Daniela B. Friedman,et al.  A Systematic Review of Readability and Comprehension Instruments Used for Print and Web-Based Cancer Information , 2006, Health education & behavior : the official publication of the Society for Public Health Education.

[2]  Elmer V. Bernstam,et al.  Instruments to assess the quality of health information on the World Wide Web: what can our patients actually use? , 2005, Int. J. Medical Informatics.

[3]  Gary L. Kreps,et al.  Library outreach: overcoming health literacy challenges. , 2005, Journal of the Medical Library Association : JMLA.

[4]  Kathleen N. Lohr,et al.  Interventions to improve health outcomes for patients with low literacy , 2005, Journal of General Internal Medicine.

[5]  R. Gunning The Technique of Clear Writing. , 1968 .

[6]  David Yarowsky,et al.  Multi-Field Information Extraction and Cross-Document Fusion , 2005, ACL.

[7]  Maged N Kamel Boulos British internet-derived patient information on diabetes mellitus: is it readable? , 2005, Diabetes technology & therapeutics.

[8]  Shirley Ann Becker,et al.  A study of web usability for older adults seeking online health resources , 2004, TCHI.

[9]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[10]  R. Kravitz,et al.  Health information on the Internet: accessibility, quality, and readability in English and Spanish. , 2001, JAMA.

[11]  Samir Chatterjee,et al.  A Classifier to Evaluate Language Specificity of Medical Documents , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[12]  Edward B. Fry,et al.  Fry's Readability Graph: Clarifications, Validity, and Extension to Level 17. , 1977 .

[13]  Jan Marco Leimeister,et al.  Online health communities , 2007, CHI Extended Abstracts.

[14]  清川 英男,et al.  CHALL, J. S. and DALE, E. (1995) Readability Revisited : The New Dale-Chall Readability Formula., Brookline Books , 1996 .

[15]  Rudolf Franz Flesch,et al.  How to write plain English : a book for lawyers and consumers , 1979 .

[16]  L. Baker,et al.  Use of the Internet and e-mail for health care information: results from a national survey. , 2003, JAMA.

[17]  Bambang Parmanto,et al.  Web Content Accessibility of Consumer Health Information Web Sites for People with Disabilities: A Cross Sectional Evaluation , 2004, Journal of medical Internet research.

[18]  Khan Mk,et al.  Health literacy: report of the Council on Scientific Affairs. Ad Hoc Committee on Health Literacy for the Council on Scientific Affairs, American Medical Association. , 1999, JAMA.

[19]  Gondy Leroy,et al.  Health Information Text Characteristics , 2006, AMIA.

[20]  Paul J. Ambrose,et al.  Neo-tribes: the power and potential of online communities in health care , 2006, CACM.

[21]  Robert H. Baud,et al.  Health search engine with e-document analysis for reliable search results , 2006, Int. J. Medical Informatics.

[22]  Kim Larsen,et al.  Generalized Naive Bayes Classifiers , 2005, SKDD.

[23]  Bambang Parmanto,et al.  Evaluation of Web Accessibility of Consumer Health Information Websites , 2003, AMIA.

[24]  J. Drew Procaccino,et al.  Toward wellness: Women seeking health information , 2004, J. Assoc. Inf. Sci. Technol..

[25]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[26]  Daniela B Friedman,et al.  Readability of cancer information on the internet. , 2004, Journal of cancer education : the official journal of the American Association for Cancer Education.

[27]  Sarah Anne Murphy,et al.  Consumer health information for pet owners. , 2006, Journal of the Medical Library Association : JMLA.

[28]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[29]  Allen C. Browne,et al.  Identifying Consumer-Friendly Display (CFD) Names for Health Concepts , 2005, AMIA.

[30]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[31]  J. Hunter,et al.  Cervical cancer educational pamphlets: Do they miss the mark for Mexican immigrant women's needs? , 2005, Cancer control : journal of the Moffitt Cancer Center.

[32]  Qing Zeng-Treitler,et al.  Exploring and developing consumer health vocabularies. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[33]  A R Jadad,et al.  Rating health information on the Internet: navigating to knowledge or to Babel? , 1998, JAMA.

[34]  Alla Keselman,et al.  Assessing Consumer Health Vocabulary Familiarity: An Exploratory Study , 2007, Journal of medical Internet research.

[35]  Rudolf Franz Flesch How to Write Plain English , 1981 .

[36]  Dean Schillinger,et al.  Diabetes Websites Accredited by the Health On the Net Foundation Code of Conduct: Readable or Not? , 2003, MIE.

[37]  Funda Meric-Bernstam,et al.  Searching for cancer-related information online: Unintended retrieval of complementary and alternative medicine information , 2005, Int. J. Medical Informatics.

[38]  Graciela Rosemblat,et al.  Assessing Readability of Consumer Health Information: An Exploratory Study , 2004, MedInfo.

[39]  Markus Dreyer,et al.  Better Informed Training of Latent Syntactic Features , 2006, EMNLP.

[40]  Jerry Avorn,et al.  Internet marketing of herbal products. , 2003, JAMA.