The Effect of Feature Representation on MEDLINE Document Classification

This work explores the effect of text representation techniques on the overall performance of medical text classification. To accomplish this goal, we developed a text classification system that supports the very basic word representation (bag-of-words) and the more complex medical phrase representation (bag-of-phrases). We also combined word and phrase representations (hybrid) for further analysis. Our system extracts medical phrases from text by incorporating a medical knowledge base and natural language processing techniques. We conducted experiments to evaluate the effects of different representations by measuring the change in classification performance with MEDLINE documents from the OHSUMED dataset. We measured classification performance with information retrieval metrics; precision (p), recall (r), and F1-score (F1). In our experiments, we achieved better classification performance with the hybrid approach (p=0.87, r=0.46, F1=0.60) compared to the bag-of-words approach (p=0.85, r=0.44, F1=0.58) and the bag-of-phrases approach (p=0.87, r=0.42, F1=0.57).