Research Paper: Text Categorization Models for High-Quality Article Retrieval in Internal Medicine

OBJECTIVE Finding the best scientific evidence that applies to a patient problem is becoming exceedingly difficult due to the exponential growth of medical publications. The objective of this study was to apply machine learning techniques to automatically identify high-quality, content-specific articles for one time period in internal medicine and compare their performance with previous Boolean-based PubMed clinical query filters of Haynes et al. DESIGN The selection criteria of the ACP Journal Club for articles in internal medicine were the basis for identifying high-quality articles in the areas of etiology, prognosis, diagnosis, and treatment. Naive Bayes, a specialized AdaBoost algorithm, and linear and polynomial support vector machines were applied to identify these articles. MEASUREMENTS The machine learning models were compared in each category with each other and with the clinical query filters using area under the receiver operating characteristic curves, 11-point average recall precision, and a sensitivity/specificity match method. RESULTS In most categories, the data-induced models have better or comparable sensitivity, specificity, and precision than the clinical query filters. The polynomial support vector machine models perform the best among all learning methods in ranking the articles as evaluated by area under the receiver operating curve and 11-point average recall precision. CONCLUSION This research shows that, using machine learning methods, it is possible to automatically build models for retrieving high-quality, content-specific articles using inclusion or citation by the ACP Journal Club as a gold standard in a given time period in internal medicine that perform better than the 1994 PubMed clinical query filters.

[1]  Lucas M. Bachmann,et al.  Research Paper: Identifying Diagnostic Studies in MEDLINE: Reducing the Number Needed to Read , 2002, J. Am. Medical Informatics Assoc..

[2]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[3]  R. Brian Haynes,et al.  Developing Optimal Search Strategies for Detecting Clinically Sound Causation Studies in MEDLINE , 2003, AMIA.

[4]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[5]  K. Shojania,et al.  Taking advantage of the explosion of systematic reviews: an efficient MEDLINE search strategy. , 2001, Effective clinical practice : ECP.

[6]  Marcello Pagano,et al.  Principles of Biostatistics , 1992 .

[7]  L. Hedges,et al.  The Handbook of Research Synthesis , 1995 .

[8]  M. Mäkelä,et al.  Sackettin pieni punainen. Kirja-arvostelu teoksesta Sackett, DL. ; Richardson WS. ; Rosenberg, W. ; Haynes RB.: Evidence-based Medicine: How to practice and Teach EBM. , 1998 .

[9]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[10]  Constantin F. Aliferis,et al.  Learning Boolean Queries for Article Quality Filtering , 2004, MedInfo.

[11]  Karen A Robinson,et al.  Development of a highly sensitive search strategy for the retrieval of reports of controlled trials using PubMed. , 2002, International journal of epidemiology.

[13]  Constantin F. Aliferis,et al.  HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection , 2003, AMIA.

[14]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[15]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[16]  R. Brian Haynes,et al.  Developing optimal search strategies for detecting clinically sound studies in MEDLINE. , 1994, Journal of the American Medical Informatics Association : JAMIA.

[17]  S. Satya‐Murti Evidence-based Medicine: How to Practice and Teach EBM , 1997 .

[18]  Yindalon Aphinyanagphongs,et al.  Text Categorization Models for Retrieval of High Quality Articles in Internal Medicine , 2003, AMIA.

[19]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[20]  M. Bigby,et al.  Evidence-based medicine in a nutshell. A guide to finding and using the best evidence in caring for patients. , 1998, Archives of dermatology.

[21]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[22]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[23]  R. Brian Haynes,et al.  Developing Optimal Search Strategies for Detecting Sound Clinical Prediction Studies in MEDLINE , 2003, AMIA.

[24]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[25]  R. Centor Signal Detectability , 1991, Medical decision making : an international journal of the Society for Medical Decision Making.

[26]  Robert E. Schapire,et al.  Theoretical Views of Boosting and Applications , 1999, ALT.

[27]  Clement. Mesavage,et al.  Purpose and procedure , 1999 .

[28]  Khalid S. Khan,et al.  A Two‐Term MEDLINE Search Strategy for Identifying Randomized Trials in Obstetrics and Gynecology , 1998, Obstetrics and gynecology.