Diagnoses Detection in Short Snippets of Narrative Medical Texts

Abstract Data extraction from narrative medical texts is a significant task to enable secondary use of medical data. Supervised learning algorithms show good results in natural language processing (NLP) tasks. We have developed a NLP framework based on supervised machine learning for entity extraction from medical texts. The framework is language independent and entities independent as long as an appropriately labeled dataset is given. The framework is based on vector representation of words and a neural network as a classifier. We have trained and evaluated the framework on two different text corpuses: diagnoses paragraphs written in German and medical records written in Russian. The neural network hyperparameters were adjusted for every dataset to get better classification results. Finally, accuracy, standard deviation, and standard error were calculated for both network models engaging 10-folds cross-validation. The obtained accuracy is 97,64% for Russian texts and 96,81% for German ones.