On the Advantage of Using Dedicated Data Mining Techniques to Predict Colorectal Cancer

Electronic Medical Records (EMRs) provide a wealth of data that can be used to generate predictive models for diseases. Quite some studies have been performed that use EMRs to generate such models for specific diseases, but most of them are based on more traditional techniques used in medical domain, such as logistic regression. This paper studies the benefit of using advanced data mining techniques for Colorectal Cancer (CRC). CRC is the second most common cancer in the EU and is known to be a disease with very a-specific predictors, making it difficult to generate good predictive models. In addition, the EMR data itself has its own challenges, including the sparsity, the differences in which physicians code the data, the temporal nature of the data, and the imbalance in the data. Results show that state-of-the-art data mining techniques, including temporal data mining, are able to generate better predictive models than currently available in the literature.

[1]  Milos Hauskrecht,et al.  A temporal pattern mining approach for classifying electronic health record data , 2013, ACM Trans. Intell. Syst. Technol..

[2]  Naren Ramakrishnan,et al.  Experiences with mining temporal event sequences from electronic medical records: initial successes and some challenges , 2011, KDD.

[3]  T. Peters,et al.  The diagnostic performance of scoring systems to identify symptomatic colorectal cancer compared to current referral guidance , 2011, Gut.

[4]  J. Hippisley-Cox,et al.  Identifying patients with suspected colorectal cancer in primary care: derivation and validation of an algorithm. , 2012, The British journal of general practice : the journal of the Royal College of General Practitioners.

[5]  R. Altman,et al.  Detecting Drug Interactions From Adverse‐Event Reports: Interaction Between Paroxetine and Pravastatin Increases Blood Glucose Levels , 2011, Clinical pharmacology and therapeutics.

[6]  Eva Steliarova-Foucher,et al.  Reprint of: Cancer incidence and mortality patterns in Europe: Estimates for 40 countries in 2012 , 2015 .

[7]  Robert-Jan Sips,et al.  Utilizing Data Mining for Predictive Modeling of Colorectal Cancer Using Electronic Medical Records , 2014, Brain Informatics and Health.

[8]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[9]  M. Numans,et al.  Identification of patients at risk for colorectal cancer in primary care: an explorative study with routine healthcare data , 2015, European journal of gastroenterology & hepatology.

[10]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[11]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[14]  J. Hippisley-Cox,et al.  Identifying patients with suspected lung cancer in primary care: derivation and validation of an algorithm. , 2011, The British journal of general practice : the journal of the Royal College of General Practitioners.

[15]  Peter P. Groenewegen,et al.  De tweede Nationale Studie naar ziekten en verrichtingen in de huisartsenpraktijk: aanleiding en methoden , 2001 .

[16]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[17]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[18]  D. Steinberg CART: Classification and Regression Trees , 2009 .

[19]  J. Ferlay,et al.  Estimates of cancer incidence and mortality in Europe in 2008. , 2010, European journal of cancer.

[20]  Richard A. Olshen,et al.  CART: Classification and Regression Trees , 1984 .

[21]  D. H. de Bakker,et al.  Tweede Nationale Studie naar ziekten en verrichtingen in de huisartspraktijk: klachten en aandoeningen in de bevolking en in de huisartspraktijk. , 2004 .

[22]  Mohammed Saeed,et al.  Risk Stratification of ICU Patients Using Topic Models Inferred from Unstructured Progress Notes , 2012, AMIA.