This work explores the effect of text representation techniques on the overall performance of medical text classification. To accomplish this goal, we developed a text classification system that supports the very basic word representation (bag-of-words) and the more complex medical phrase representation (bag-of-phrases). We also combined word and phrase representations (hybrid) for further analysis. Our system extracts medical phrases from text by incorporating a medical knowledge base and natural language processing techniques. We conducted experiments to evaluate the effects of different representations by measuring the change in classification performance with MEDLINE documents from the OHSUMED dataset. We measured classification performance with information retrieval metrics; precision (p), recall (r), and F1-score (F1). In our experiments, we achieved better classification performance with the hybrid approach (p=0.87, r=0.46, F1=0.60) compared to the bag-of-words approach (p=0.85, r=0.44, F1=0.58) and the bag-of-phrases approach (p=0.87, r=0.42, F1=0.57).
[1]
Wesley W. Chu,et al.
Free-text medical document retrieval via phrase-based vector space model
,
2002,
AMIA.
[2]
Fabrizio Sebastiani,et al.
Machine learning in automated text categorization
,
2001,
CSUR.
[3]
David D. Lewis,et al.
An evaluation of phrasal and clustered representations on a text categorization task
,
1992,
SIGIR '92.
[4]
Thorsten Joachims,et al.
Learning to classify text using support vector machines - methods, theory and algorithms
,
2002,
The Kluwer international series in engineering and computer science.
[5]
James P. Callan,et al.
Training algorithms for linear text classifiers
,
1996,
SIGIR '96.
[6]
Padmini Srinivasan,et al.
Hierarchical neural networks for text categorization
,
1999,
SIGIR 1999.