Using Natural Language Processing to Enable In-depth Analysis of Clinical Messages Posted to an Internet Mailing List: A Feasibility Study

Background An Internet mailing list may be characterized as a virtual community of practice that serves as an information hub with easy access to expert advice and opportunities for social networking. We are interested in mining messages posted to a list for dental practitioners to identify clinical topics. Once we understand the topical domain, we can study dentists’ real information needs and the nature of their shared expertise, and can avoid delivering useless content at the point of care in future informatics applications. However, a necessary first step involves developing procedures to identify messages that are worth studying given our resources for planned, labor-intensive research. Objectives The primary objective of this study was to develop a workflow for finding a manageable number of clinically relevant messages from a much larger corpus of messages posted to an Internet mailing list, and to demonstrate the potential usefulness of our procedures for investigators by retrieving a set of messages tailored to the research question of a qualitative research team. Methods We mined 14,576 messages posted to an Internet mailing list from April 2008 to May 2009. The list has about 450 subscribers, mostly dentists from North America interested in clinical practice. After extensive preprocessing, we used the Natural Language Toolkit to identify clinical phrases and keywords in the messages. Two academic dentists classified collocated phrases in an iterative, consensus-based process to describe the topics discussed by dental practitioners who subscribe to the list. We then consulted with qualitative researchers regarding their research question to develop a plan for targeted retrieval. We used selected phrases and keywords as search strings to identify clinically relevant messages and delivered the messages in a reusable database. Results About half of the subscribers (245/450, 54.4%) posted messages. Natural language processing (NLP) yielded 279,193 clinically relevant tokens or processed words (19% of all tokens). Of these, 2.02% (5634 unique tokens) represent the vocabulary for dental practitioners. Based on pointwise mutual information score and clinical relevance, 325 collocated phrases (eg, fistula filled obturation and herpes zoster) with 108 keywords (eg, mercury) were classified into 13 broad categories with subcategories. In the demonstration, we identified 305 relevant messages (2.1% of all messages) over 10 selected categories with instances of collocated phrases, and 299 messages (2.1%) with instances of phrases or keywords for the category systemic disease. Conclusions A workflow with a sequence of machine-based steps and human classification of NLP-discovered phrases can support researchers who need to identify relevant messages in a much larger corpus. Discovered phrases and keywords are useful search strings to aid targeted retrieval. We demonstrate the potential value of our procedures for qualitative researchers by retrieving a manageable set of messages concerning systemic and oral disease.

[1]  Horacio Saggion,et al.  Multi-document summarization by cluster/prole relevance and redundancy removal , 2004 .

[2]  D Brad Rindal,et al.  The creation and development of the dental practice-based research network. , 2008, Journal of the American Dental Association.

[3]  Grigorios Tsoumakas,et al.  Email Mining: Emerging Techniques for Email Management , 2006 .

[4]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[5]  H Spallek,et al.  Supporting the emergence of dental informatics with an online community. , 2007, International journal of computerized dentistry.

[6]  Linda Shields,et al.  Content Analysis , 2015 .

[7]  Carol F. Landry,et al.  Work roles, tasks, and the information behavior of dentists , 2006 .

[8]  F. Curro,et al.  Practice-based research networks and their impact on dentistry: creating a pathway for change in the profession. , 2009, Compendium of continuing education in dentistry.

[9]  G. Eysenbach Infodemiology and Infoveillance: Framework for an Emerging Set of Public Health Informatics Methods to Analyze Search, Communication and Publication Behavior on the Internet , 2009, Journal of medical Internet research.

[10]  E. Wenger,et al.  Cultivating Communities of Practice: A Guide to Managing Knowledge , 2002 .

[11]  V. Freimuth,et al.  A Descriptive Analysis of 10 Years of Research Published in the Journal of Health Communication , 2006, Journal of health communication.

[12]  Heiko Spallek,et al.  Barriers to implementing evidence-based clinical guidelines: a survey of early adopters. , 2010, The journal of evidence-based dental practice.

[13]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[14]  J L Forrest,et al.  Is the Internet useful for clinical practice? , 1999, Journal of the American Dental Association.

[15]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[16]  F N Leach,et al.  Drug information and the dental practitioner. , 1981, Dental update.

[17]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[18]  G. Eysenbach,et al.  Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak , 2010, PloS one.

[19]  Olof Torgersson,et al.  SOMWeb: A Semantic Web-Based System for Supporting Collaboration of Distributed Medical Communities of Practice , 2008, Journal of medical Internet research.

[20]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[21]  D. Lancaster,et al.  Information needs of practicing dentists. , 1986, Bulletin of the Medical Library Association.

[22]  G. Eysenbach,et al.  Ethical issues in qualitative research on internet communities , 2001, BMJ : British Medical Journal.

[23]  Juris Dilevko,et al.  The relevance of classification theory to textual analysis , 2009 .

[24]  Wai Lam,et al.  Evaluation Challenges in Large-Scale Document Summarization , 2003, ACL.

[25]  Jennifer Preece,et al.  Lurker demographics: counting the silent , 2000, CHI.

[26]  Cynthia Brandt,et al.  Biomedical Informatics Techniques for Processing and Analyzing Web Blogs of Military Service Members , 2010, Journal of medical Internet research.

[27]  Heiko Spallek,et al.  Supporting Emerging Disciplines with e-Communities: Needs and Benefits , 2008, Journal of medical Internet research.

[28]  Mei Song,et al.  How information systems should support the information needs of general dentists in clinical settings: suggestions from a qualitative study , 2010, BMC Medical Informatics Decis. Mak..

[29]  B. Ewigman,et al.  Practice-Based Research Networks: The Laboratories of Primary Care Research , 2004, Medical care.