Computer-Assisted Update of a Consumer Health Vocabulary Through Mining of Social Network Data

Background Consumer health vocabularies (CHVs) have been developed to aid consumer health informatics applications. This purpose is best served if the vocabulary evolves with consumers’ language. Objective Our objective was to create a computer assisted update (CAU) system that works with live corpora to identify new candidate terms for inclusion in the open access and collaborative (OAC) CHV. Methods The CAU system consisted of three main parts: a Web crawler and an HTML parser, a candidate term filter that utilizes natural language processing tools including term recognition methods, and a human review interface. In evaluation, the CAU system was applied to the health-related social network website PatientsLikeMe.com. The system’s utility was assessed by comparing the candidate term list it generated to a list of valid terms hand extracted from the text of the crawled webpages. Results The CAU system identified 88,994 unique terms 1- to 7-grams (“n-grams” are n consecutive words within a sentence) in 300 crawled PatientsLikeMe.com webpages. The manual review of the crawled webpages identified 651 valid terms not yet included in the OAC CHV or the Unified Medical Language System (UMLS) Metathesaurus, a collection of vocabularies amalgamated to form an ontology of medical terms, (ie, 1 valid term per 136.7 candidate n-grams). The term filter selected 774 candidate terms, of which 237 were valid terms, that is, 1 valid term among every 3 or 4 candidates reviewed. Conclusion The CAU system is effective for generating a list of candidate terms for human review during CHV development.

[1]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[2]  Aarne Ranta,et al.  Proceedings of the 6th international conference on Advances in Natural Language Processing , 2008 .

[3]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[4]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[5]  Angus Roberts,et al.  A Large Scale Terminology Resource for Biomedical Text Processing , 2004, HLT-NAACL 2004.

[6]  R. Burchfield Oxford English dictionary , 1982 .

[7]  Alla Keselman,et al.  Term Identification Methods for Consumer Health Vocabulary Development , 2007, Journal of medical Internet research.

[8]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[9]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[10]  Dietrich Rebholz-Schuhmann,et al.  Facilitating the development of controlled vocabularies for metabolomics technologies with text mining , 2008, BMC Bioinformatics.

[11]  J. Simpson,et al.  The Oxford English Dictionary , 1884 .

[12]  Qing Zeng-Treitler,et al.  Exploring and developing consumer health vocabularies. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[13]  J. Frost,et al.  Social Uses of Personal Health Information Within PatientsLikeMe, an Online Patient Community: What Can Happen When Patients Have Access to One Another’s Data , 2008, Journal of medical Internet research.

[14]  Scott T. Weiss,et al.  Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system , 2006, BMC Medical Informatics Decis. Mak..

[15]  Morten H. Christiansen,et al.  Language as shaped by the brain. , 2008, The Behavioral and brain sciences.

[16]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[17]  Q. Zeng,et al.  Exploring and Developing Consumer Health Vocabularies , 2005 .

[18]  Hongfang Liu,et al.  BioTagger-GM: a gene/protein name recognition system. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[19]  Nicolette de Keizer,et al.  Development and Application of a Framework for Maintenance of Medical Terminological Systems , 2008, Journal of the American Medical Informatics Association.

[20]  Simon J. Greenhill,et al.  Languages Evolve in Punctuational Bursts , 2008, Science.

[21]  Gondy Leroy,et al.  Research Paper: Consumer Health Concepts That Do Not Map to the UMLS: Where Do They Fit? , 2008, J. Am. Medical Informatics Assoc..

[22]  J. Cimino Desiderata for Controlled Medical Vocabularies in the Twenty-First Century , 1998, Methods of Information in Medicine.

[23]  J. Frost,et al.  Sharing Health Data for Better Outcomes on PatientsLikeMe , 2010, Journal of medical Internet research.

[24]  Bengt Nordström,et al.  Advances in Natural Language Processing , 2008 .

[25]  M. Studdert-Kennedy,et al.  Approaches To The Evolution Of Language: Social And Cognitive Bases , 1998 .

[26]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..