Automatic Recognition of Text Difficulty from Consumers Health Information

Internet is used as one of major sources of health information. However, some studies show that the readability of health information presented on health Web sites is difficult for many consumers. Readability formulas usually measure difficulty of writing style, instead of difficulty of content. In order to recommend health information with appropriate reading level to consumers, we investigate the feasibility of identifying text difficulty of health information using machine learning methods. Support vector machine is used to classify consumer health information into easy to read and reading level for the general public. Three feature sets: surface linguistic features, word difficulty features, unigrams and their combinations are compared in terms of classification accuracy. Unigram features alone reach an accuracy of 80.71%, and the combination of three feature sets is the most effective in classification with accuracy of 84.06%. They are significantly better than surface linguistic features, word difficulty features and their combination

[1]  L. G. Doak,et al.  Teaching Patients With Low Literacy Skills , 1985 .

[2]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[3]  Marilyn H Oermann,et al.  Evaluation of Web sites on management of pain in children. , 2003, Pain management nursing : official journal of the American Society of Pain Management Nurses.

[4]  清川 英男,et al.  CHALL, J. S. and DALE, E. (1995) Readability Revisited : The New Dale-Chall Readability Formula., Brookline Books , 1996 .

[5]  George R. Klare,et al.  Readable computer documentation , 2000, AJCD.

[6]  Evangelos E. Milios,et al.  Filtering for medical news items using a machine learning approach , 2002, AMIA.

[7]  Luo Si,et al.  A statistical model for scientific readability , 2001, CIKM '01.

[8]  J. Powell,et al.  Empirical studies assessing the quality of health information for consumers on the world wide web: a systematic review. , 2002, JAMA.

[9]  D M D'Alessandro,et al.  The readability of pediatric patient education materials on the World Wide Web. , 2001, Archives of pediatrics & adolescent medicine.

[10]  Jianhua Li,et al.  Analysis of Polarity Information in Medical Text , 2005, AMIA.

[11]  Daniela B Friedman,et al.  Readability of cancer information on the internet. , 2004, Journal of cancer education : the official journal of the American Association for Cancer Education.

[12]  Graciela Rosemblat,et al.  Assessing Readability of Consumer Health Information: An Exploratory Study , 2004, MedInfo.

[13]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[14]  W. Bruce Croft,et al.  Automatic recognition of reading levels from user queries , 2004, SIGIR '04.

[15]  Raymond L. Ownby,et al.  Influence of Vocabulary and Sentence Complexity and Passive Voice on the Readability of Consumer-Oriented Mental Health Information on the Internet , 2005, AMIA.