Lifelong Machine Learning and root cause analysis for large-scale cancer patient data

IntroductionThis paper presents a lifelong learning framework which constantly adapts with changing data patterns over time through incremental learning approach. In many big data systems, iterative re-training high dimensional data from scratch is computationally infeasible since constant data stream ingestion on top of a historical data pool increases the training time exponentially. Therefore, the need arises on how to retain past learning and fast update the model incrementally based on the new data. Also, the current machine learning approaches do the model prediction without providing a comprehensive root cause analysis. To resolve these limitations, our framework lays foundations on an ensemble process between stream data with historical batch data for an incremental lifelong learning (LML) model.Case descriptionA cancer patient’s pathological tests like blood, DNA, urine or tissue analysis provide a unique signature based on the DNA combinations. Our analysis allows personalized and targeted medications and achieves a therapeutic response. Model is evaluated through data from The National Cancer Institute’s Genomic Data Commons unified data repository. The aim is to prescribe personalized medicine based on the thousands of genotype and phenotype parameters for each patient.Discussion and evaluationThe model uses a dimension reduction method to reduce training time at an online sliding window setting. We identify the Gleason score as a determining factor for cancer possibility and substantiate our claim through Lilliefors and Kolmogorov–Smirnov test. We present clustering and Random Decision Forest results. The model’s prediction accuracy is compared with standard machine learning algorithms for numeric and categorical fields.ConclusionWe propose an ensemble framework of stream and batch data for incremental lifelong learning. The framework successively applies first streaming clustering technique and then Random Decision Forest Regressor/Classifier to isolate anomalous patient data and provides reasoning through root cause analysis by feature correlations with an aim to improve the overall survival rate. While the stream clustering technique creates groups of patient profiles, RDF further drills down into each group for comparison and reasoning for useful actionable insights. The proposed MALA architecture retains the past learned knowledge and transfer to future learning and iteratively becomes more knowledgeable over time.

[1]  Sebastian Thrun,et al.  Explanation-based neural network learning a lifelong learning approach , 1995 .

[2]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data: A Survey , 2014, IEEE Transactions on Knowledge and Data Engineering.

[3]  R. Purves,et al.  Optimum numerical integration methods for estimation of area-under-the-curve (AUC) and area-under-the-moment-curve (AUMC) , 1992, Journal of Pharmacokinetics and Biopharmaceutics.

[4]  D G Altman,et al.  Survival probabilities (the Kaplan-Meier method) , 1998, BMJ.

[5]  A. V. Peterson Expressing the Kaplan-Meier estimator as a function of empirical subsurvival functions , 1977 .

[6]  Kenli Li,et al.  A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment , 2017, IEEE Transactions on Parallel and Distributed Systems.

[7]  Qiang Yang,et al.  Lifelong Machine Learning Test , 2015, AAAI 2015.

[8]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[9]  Y. B. Wah,et al.  Power comparisons of Shapiro-Wilk , Kolmogorov-Smirnov , Lilliefors and Anderson-Darling tests , 2011 .

[10]  Gangmin Li,et al.  Multi-Agent Big-Data Lambda Architecture Model for E-Commerce Analytics , 2018, Data.

[11]  Eric Eaton,et al.  ELLA: An Efficient Lifelong Learning Algorithm , 2013, ICML.

[12]  Bing Liu,et al.  Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data , 2014, ICML.

[13]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[14]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[15]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[16]  Bing Liu,et al.  Mining Aspect-Specific Opinion using a Holistic Lifelong Topic Model , 2016, WWW.

[17]  Joshua Zhexue Huang,et al.  Big data analytics on Apache Spark , 2016, International Journal of Data Science and Analytics.

[18]  Hal Daumé,et al.  Learning Task Grouping and Overlap in Multi-task Learning , 2012, ICML.

[19]  Mourad Khayati,et al.  Online Anomaly Detection over Big Data Streams , 2019, Applied Data Science.

[20]  H. Abdi,et al.  Lilliefors / Van Soest ’ s test of normality , 2006 .

[21]  Xin Huang,et al.  Lifelong Machine Learning: Outlook and Direction , 2018, ICBDR.

[22]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[23]  Xin Huang,et al.  Semi-Unsupervised Lifelong Learning for Sentiment Classification: Less Manual Data Annotation and More Self-Studying , 2019, ArXiv.

[24]  Robert E. Mercer,et al.  The Task Rehearsal Method of Life-Long Learning: Overcoming Impoverished Data , 2002, Canadian Conference on AI.

[25]  Daniel L. Silver,et al.  Sequential Consolidation of Learned Task Knowledge , 2004, Canadian Conference on AI.

[26]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[27]  Daniel L. Silver,et al.  The Parallel Transfer of Task Knowledge Using Dynamic Learning Rates Based on a Measure of Relatedness , 1996, Connect. Sci..

[28]  Bhavani M. Thuraisingham,et al.  Spark-based anomaly detection over multi-source VMware performance data in real-time , 2014, 2014 IEEE Symposium on Computational Intelligence in Cyber Security (CICS).

[29]  Mohammed Erritali,et al.  A comparative study of decision tree ID3 and C4.5 , 2014 .

[30]  Deepak Agarwal,et al.  Statistical Methods for Recommender Systems , 2016 .

[31]  Gangmin Li,et al.  Big Data Ingestion and Lifelong Learning Architecture , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[32]  Qian Liu,et al.  Improving Opinion Aspect Extraction Using Semantic Similarity and Aspect Associations , 2016, AAAI.

[33]  H. Lilliefors On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown , 1967 .

[34]  Shuai Wang,et al.  Learning Cumulatively to Become More Knowledgeable , 2016, KDD.

[35]  Daniel L. Silver,et al.  Consolidation Using Sweep Task Rehearsal: Overcoming the Stability-Plasticity Problem , 2015, Canadian Conference on AI.