A high throughput semantic concept frequency based approach for patient identification: a case study using type 2 diabetes mellitus clinical notes.

UNLABELLED Current research on high throughput identification of patients with a specific phenotype is in its infancy. There is an urgent need to develop a general automatic approach for patient identification. OBJECTIVE We took advantage of Mayo Clinic electronic clinical notes and proposed a novel method of combining NLP, machine learning, and ontology for automatic patient identification. We also investigated the benefits of involving existing SNOMED semantic knowledge in a patient identification task. METHODS the SVM algorithm was applied on SNOMED concept units extracted from T2DM case/control clinical notes. Precision, recall, and F-score were calculated to evaluate the performance. RESULTS This approach achieved an F-score of above 0.950 for both groups when using all identified concept units as features. Concept units from semantic type-Disease or Syndrome contain the most important information for patient identification. Our results also implied that the coarse level concepts contain enough information to classify T2DM cases/controls.

[1]  Alexa T. McCray,et al.  Research Paper: Evaluating the Coverage of Controlled Health Data Terminologies: Report on the Results of the NLM/AHCPR Large Scale Vocabulary Test , 1997, J. Am. Medical Informatics Assoc..

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[3]  Y. Lussier,et al.  Computational approaches to phenotyping: high-throughput phenomics. , 2007, Proceedings of the American Thoracic Society.

[4]  Serguei V. S. Pakhomov,et al.  Systolic and diastolic heart failure in the community. , 2006, JAMA.

[5]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[6]  Spencer E. Harpe,et al.  Use of International Classification of Diseases, Ninth Revision Clinical Modification Codes and Medication Use Data to Identify Nosocomial Clostridium difficile Infection , 2009, Infection Control & Hospital Epidemiology.

[7]  David Aron,et al.  Failure of ICD-9-CM codes to identify patients with comorbid chronic kidney disease in diabetes. , 2006, Health services research.

[8]  Serguei V. S. Pakhomov,et al.  Epidemiology of angina pectoris: role of natural language processing of the medical record. , 2007, American heart journal.

[9]  Christopher G. Chute,et al.  The horizontal and vertical nature of patient phenotype retrieval: new directions for clinical text processing , 2002, AMIA.

[10]  B. Gage,et al.  Accuracy of ICD-9-CM Codes for Identifying Cardiovascular and Stroke Risk Factors , 2005, Medical care.

[11]  Steven H. Brown,et al.  Evaluation of the content coverage of SNOMED CT: ability of SNOMED clinical terms to represent clinical problem lists. , 2006, Mayo Clinic proceedings.

[12]  Christopher G. Chute,et al.  Viewpoint: Clinical Classification and Terminology: Some History and Current Observations , 2000, J. Am. Medical Informatics Assoc..

[13]  Lawrence M. Fagan,et al.  Medical informatics: computer applications in health care and biomedicine (Health informatics) , 2003 .

[14]  Nicolette de Keizer,et al.  Forty years of SNOMED: a literature review , 2008, BMC Medical Informatics Decis. Mak..

[15]  Clement J. McDonald,et al.  What can natural language processing do for clinical decision support? , 2009, J. Biomed. Informatics.

[16]  Francis S Collins,et al.  The genome gets personal--almost. , 2008, JAMA.