Classification of Cancer Stage from Free-text Histology Reports

This article investigates the classification of a patient's lung cancer stage based on analysis of their free-text medical reports. The system uses natural language processing to transform the report text, including identification of UMLS terms and detection of negated findings. The transformed report is then classified using statistical machine learning techniques. A support vector machine is trained for each stage category based on word occurrences in a corpus of histology reports for pathologically staged patients. New reports can be classified according to the most likely stage, allowing the collection of population stage data for analysis of outcomes. While the system could in principle be applied to stage different cancer types, the current work focuses on lung cancer due to data availability. The article presents initial experiments quantifying system performance for T and N staging on a corpus of histology reports from more than 700 lung cancer patients

[1]  George Hripcsak,et al.  Research Paper: The Role of Domain Knowledge in Automating Medical Text Report Classification , 2003, J. Am. Medical Informatics Assoc..

[2]  Sankar K. Pal,et al.  Staging of cervical cancer with soft computing , 2000, IEEE Transactions on Biomedical Engineering.

[3]  P. Phinjaroenphan,et al.  Automated prognostic tool for cervical cancer patient database , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  C. Compton,et al.  AJCC Cancer Staging Manual , 2002, Springer New York.

[6]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Yindalon Aphinyanagphongs,et al.  Research Paper: Text Categorization Models for High-Quality Article Retrieval in Internal Medicine , 2004, J. Am. Medical Informatics Assoc..

[10]  Lin Fritschi,et al.  Collection of population-based cancer staging information in Western Australia – a feasibility study , 2005, Population health metrics.

[11]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[12]  Wendy W. Chapman,et al.  Fever detection from free-text clinical records for biosurveillance , 2004, Journal of Biomedical Informatics.

[13]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[14]  Randolph A. Miller,et al.  Research Paper: An Experiment Comparing Lexical and Statistical Methods for Extracting MeSH Terms from Clinical Free Text , 1998, J. Am. Medical Informatics Assoc..

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Wendy W. Chapman,et al.  Accuracy of three classifiers of acute gastrointestinal syndrome for syndromic surveillance , 2002, AMIA.

[17]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[18]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[19]  Peter J. Haug,et al.  Classifying free-text triage chief complaints into syndromic categories with natural language processing , 2005, Artif. Intell. Medicine.