A Study on Textual Features for Medical Records Classification

Healthcare domain is characterized by a huge amount of data, contained in medical records, reports, test results and so on. In order to give support to healthcare workers and manage relevant data in effective and efficient way, it is important to correctly classify the unstructured parts of text, embedded in the medical documents. In this paper, we propose a classification system for medical records categorization, focused on the combination of different methodologies, based on lexical, syntactical and semantic analysis of the documents. We will show that a Classification System based on a combination of different text analysis methodologies overcomes the performances of each methodology taken alone. The obtained results will be presented in terms of Accuracy-Rejection Curves. Eventually, pro and cons of the architecture proposed and some future work will be pointed out.