Biomedical informatics with optimization and machine learning

Fast-growing biomedical and healthcare data have encompassed multiple scales ranging from molecules, individuals, to populations and have connected various entities in healthcare systems (providers, pharma, payers) with increasing bandwidth, depth, and resolution. Those data are becoming an enabling resource for accelerating basic science discoveries and facilitating evidence-based clinical solutions. Although the methods for extracting patterns from data have been around for centuries, it is still extremely difficult to transform massive data into valuable knowledge by these traditional means of analysis. This motivates the development of modern analytics methods, which are designed to discover meaningful representations or structures of data using optimization and machine-learning methods. In a broad sense, there are two types of applications in biomedical informatics where optimization and machine-learning methods are commonly used. One focuses on the knowledge discovery by analyzing historical data to provide insights on what happened and why it happened. Methods such as data statistical modeling, trend reporting, and visualization as association and correlation analysis have been commonly used in this sort of applications. Another sort of applications, on the other hand, focus on prediction and decision-making applications that use a known dataset (aka the training dataset), and which includes input data features and response values, to build a predictive model and scale it to make predictions using unseen data (aka the test dataset). It has been a consensus that the sheer volume and complexity of the data we could easily acquire nowadays in biomedical informatics present major barriers toward their translation into effective clinical actions. There is thus a compelling demand for novel algorithms, including machine learning, data mining, and optimization that specifically tackle the unique challenges associated with the biomedical and healthcare data and allow decision-makers and stakeholders to better interpret and exploit the data. Recent years have witnessed major breakthroughs in machine learning when it is equipped with powerful optimization technologies. On a general note, biomedical data often feature large volumes, high dimensions, imbalanced classes, heterogeneous sources, noisy data, incompleteness, and rich contexts. Such demanding features are also driving the development of numerical optimization algorithms in tandem with machine learning algorithms. For example, it has been a challenge to deal with roadblocks in the biomedical informatics area given the ubiquitous existence of data challenges such as imbalanced datasets, weakly structured or unstructured data, noisy and ambiguous labeling. Also, the optimization algorithms should scale up to the complexity of biomedical data that is usually largescale, high-dimensional, heterogeneous, and noisy. It is also of much interest to study and revisit traditional machine-learning topics such as clustering, classification, regression, and dimension reduction and turn them into powerful customized approaches for the newly emerging biomedical informatics problems such as electronic medical records analysis and heterogeneous data fusion. Besides the methodological issues, there are much to be learned through the application of these methods in real-world applications, regarding how the context of the applications informs the design, implementation, interpretation, and validation of these methods. Challenging applications are present in many areas of biomedical informatics, such as Computational Biology, which includes the advanced interpretation of critical biological findings, using databases and cutting-edge computational infrastructure; Clinical Informatics, which includes the scenarios of using computation and data for health care, spanning medicine, dentistry, nursing, pharmacy, and allied health; Public Health Informatics, which includes the studies of patients and populations to improve the public health system and to elucidate epidemiology; mHealth Applications, which include the * Correspondence: yshen@tamu.edu Department of Electrical and Computer Engineering and TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX 77843, USA Full list of author information is available at the end of the article