Large-Scale Machine Learning Algorithms for Biomedical Data Science

Data science is accelerating the translation of biological and biomedical data to advance the detection, diagnosis, treatment, and prevention of diseases. However, the unprecedented scale and complexity of large-scale biomedical data have presented critical computational bottlenecks requiring new concepts and enabling tools. To address the challenging problems in current biomedical data science, we proposed several novel large-scale machine learning models for multi-dimensional data integration, heterogeneous multi-task learning, longitudinal feature learning, etc. Meanwhile, to deal with the big data computations, we proposed new asynchronous distributed stochastic gradient and coordinate descent methods for efficiently solving convex and non-convex problems, and also parallelized the deep learning optimization algorithms with layer-wise model parallelism. We applied our new large-scale machine learning models to analyze the multi-modal and longitudinal Electronic Medical Records (EMR) for predicting the heart failure patients' readmission and drug side effects, integrate the neuroimaging and genome-wide array data to recognize the phenotypic and genotypic biomarkers, and detect the histopathological image markers and the multi-dimensional cancer genomic biomarkers in precision medicine studies.