Support Vector Feature Selection for Early Detection of Anastomosis Leakage From Bag-of-Words in Electronic Health Records

The free text in electronic health records (EHRs) conveys a huge amount of clinical information about health state and patient history. Despite a rapidly growing literature on the use of machine learning techniques for extracting this information, little effort has been invested toward feature selection and the features' corresponding medical interpretation. In this study, we focus on the task of early detection of anastomosis leakage (AL), a severe complication after elective surgery for colorectal cancer (CRC) surgery, using free text extracted from EHRs. We use a bag-of-words model to investigate the potential for feature selection strategies. The purpose is earlier detection of AL and prediction of AL with data generated in the EHR before the actual complication occur. Due to the high dimensionality of the data, we derive feature selection strategies using the robust support vector machine linear maximum margin classifier, by investigating: 1) a simple statistical criterion (leave-one-out-based test); 2) an intensive-computation statistical criterion (Bootstrap resampling); and 3) an advanced statistical criterion (kernel entropy). Results reveal a discriminatory power for early detection of complications after CRC (sensitivity 100%; specificity 72%). These results can be used to develop prediction models, based on EHR data, that can support surgeons and patients in the preoperative decision making phase.

[1]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[2]  Zubair Afzal,et al.  Automatic generation of case‐detection algorithms to identify children with asthma from large electronic health record databases , 2013, Pharmacoepidemiology and drug safety.

[3]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[4]  D G Jayne,et al.  Systematic review of methods to predict and detect anastomotic leakage in colorectal surgery , 2014, Colorectal disease : the official journal of the Association of Coloproctology of Great Britain and Ireland.

[5]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[6]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[7]  Sebastian Garde,et al.  Towards Semantic Interoperability for Electronic Health Records , 2007, Methods of Information in Medicine.

[8]  Evelyn J. S. Hovenga,et al.  Towards Semantic Interoperability for Electronic Health Records : Domain Knowledge Governance for open EHR Archetypes , 2007 .

[9]  N. Harlaar,et al.  Surgeons lack predictive accuracy for anastomotic leakage in gastrointestinal surgery , 2009, International Journal of Colorectal Disease.

[10]  John Shawe-Taylor,et al.  Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning , 2012, PloS one.

[11]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[12]  Hein Putter,et al.  Predicting the risk of anastomotic leakage in left-sided colorectal surgery using a colon leakage score. , 2011, The Journal of surgical research.

[13]  Azra Bihorac,et al.  Knowledge Extraction and Outcome Prediction using Medical Notes , 2013 .

[14]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[15]  José Luis Rojo-Álvarez,et al.  On the differential benchmarking of promotional efficiency with machine learning modeling (I): Principles and statistical comparison , 2012, Expert Syst. Appl..

[16]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[17]  Yongchao Liu,et al.  A framework and its empirical study of automatic diagnosis of traditional Chinese medicine utilizing raw free-text clinical records , 2012, J. Biomed. Informatics.

[18]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[19]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[20]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[21]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[22]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[23]  Pál Ondrejka,et al.  ["Fast track" colorectal surgery]. , 2007, Orvosi hetilap.

[24]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[25]  Janet L. Peacock,et al.  Oxford Handbook of Medical Statistics , 2010 .

[26]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[27]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[28]  Suchi Saria,et al.  Developing Predictive Models Using Electronic Medical Records: Challenges and Pitfalls , 2013, AMIA.

[29]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[30]  Adam Wright,et al.  Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions , 2013, J. Am. Medical Informatics Assoc..

[31]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[32]  Keinosuke Fukunaga,et al.  Effects of Sample Size in Classifier Design , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Robert Jenssen,et al.  Kernel Entropy Component Analysis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  K. Havenga,et al.  Anastomotic leakage as an outcome measure for quality of colorectal cancer surgery , 2013, BMJ quality & safety.

[35]  Gavin C. Cawley,et al.  Fast exact leave-one-out cross-validation of sparse least-squares support vector machines , 2004, Neural Networks.

[36]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[37]  Alexander J. Smola,et al.  Classification in a normalized feature space using support vector machines , 2003, IEEE Trans. Neural Networks.

[38]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..

[39]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[40]  B. Møller,et al.  Cancer incidence, mortality, survival and prevalence in Norway , 2011 .

[41]  Marko Grobelnik,et al.  Feature Selection Using Support Vector Machines , 2002 .

[42]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[43]  G. Crooks On Measures of Entropy and Information , 2015 .

[44]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[45]  Fernanda Polubriaginof,et al.  The feasibility of using natural language processing to extract clinical information from breast pathology reports , 2012, Journal of pathology informatics.

[46]  Joel J. P. C. Rodrigues,et al.  Health Information Systems: Concepts, Methodologies, Tools, and Applications , 2009 .

[47]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[48]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[49]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..