NOVEL APPLICATIONS OF MACHINE LEARNING IN BIOINFORMATICS

OF DISSERTATION NOVEL APPLICATIONS OF MACHINE LEARNING IN BIOINFORMATICS Technological advances in next-generation sequencing and biomedical imaging have led to a rapid increase in biomedical data dimension and acquisition rate, which is challenging the conventional data analysis strategies. Modern machine learning techniques promise to leverage large data sets for finding hidden patterns within them, and for making accurate predictions. This dissertation aims to design novel machine learning-based models to transform biomedical big data into valuable biological insights. The research presented in this dissertation focuses on three bioinformatics domains: splice junction classification, gene regulatory network reconstruction, and lesion detection in mammograms. A critical step in defining gene structures and mRNA transcript variants is to accurately identify splice junctions. In the first work, we built the first deep learning-based splice junction classifier, DeepSplice. It outperforms the state-of-the-art classification tools in terms of both classification accuracy and computational efficiency. To uncover transcription factors governing metabolic reprogramming in non-small-cell lung cancer patients, we developed TFmeta, a machine learning approach to reconstruct relationships between transcription factors and their target genes in the second work. Our approach achieves the best performance on benchmark data sets. In the third work, we designed deep learning-based architectures to perform lesion detection in both 2D and 3D whole mammogram images.

[1]  Hang Li Learning to Rank , 2017, Encyclopedia of Machine Learning and Data Mining.

[2]  Wen-Huang Cheng,et al.  Computer-aided classification of lung nodules on computed tomography images via deep learning technique , 2015, OncoTargets and therapy.

[3]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[4]  Salvatore Rampone,et al.  Recognition of splice junctions on DNA sequences by BRAIN learning algorithm , 1998, Bioinform..

[5]  Derek Y. Chiang,et al.  DiffSplice: the genome-wide detection of differential splicing events with RNA-seq , 2012, Nucleic acids research.

[6]  Mong-Hong Lee,et al.  Cancer metabolic reprogramming: importance, main features, and potentials for precise targeted anti-cancer therapies , 2014, Cancer biology & medicine.

[7]  L. Cantley,et al.  Understanding the Warburg Effect: The Metabolic Requirements of Cell Proliferation , 2009, Science.

[8]  Yi Li,et al.  Gene expression inference with deep learning , 2015, bioRxiv.

[9]  P. Libby,et al.  Braunwald's Heart Disease: A Textbook of Cardiovascular Medicine, 2-Volume Set, 9th Edition Expert Consult Premium Edition €“ Enhanced Online Features , 2011 .

[10]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[11]  S. Ranade,et al.  Stem cell transcriptome profiling via massive-scale mRNA sequencing , 2008, Nature Methods.

[12]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[13]  Eyal Gottlieb,et al.  Mitochondrial tumour suppressors: a genetic and biochemical update , 2005, Nature Reviews Cancer.

[14]  H. Weedon-Fekjær,et al.  Effectiveness of population‐based service screening with mammography for women ages 40 to 49 years , 2012, Cancer.

[15]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[16]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[17]  Yvan Saeys,et al.  SpliceMachine: predicting splice sites from high-dimensional local context representations , 2005, Bioinform..

[18]  Daniel Lévy,et al.  Breast Mass Classification from Mammograms using Deep Convolutional Neural Networks , 2016, ArXiv.

[19]  S. Cook,et al.  FineSplice, enhanced splice junction detection and quantification: a novel pipeline based on the assessment of diverse RNA-Seq alignment solutions , 2014, Nucleic acids research.

[20]  Avi Ma'ayan,et al.  ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments , 2010, Bioinform..

[21]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[22]  T. Hughes,et al.  The Human Transcription Factors , 2018, Cell.

[23]  Qingshan Jiang,et al.  A novel splice site prediction method using support vector machine , 2013 .

[24]  Jun Zhang,et al.  A Novel KLF4/LDHA Signaling Pathway Regulates Aerobic Glycolysis in and Progression of Pancreatic Cancer , 2014, Clinical Cancer Research.

[25]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[26]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[27]  Eirini Arvaniti,et al.  Sensitive detection of rare disease-associated cell subsets via representation learning , 2016, Nature Communications.

[28]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[29]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[30]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[31]  Q. Pang,et al.  EZH2 promotes metabolic reprogramming in glioblastomas through epigenetic repression of EAF2-HIF1α signaling , 2016, Oncotarget.

[32]  F. Luo,et al.  The forkhead box transcription factor-2 (Foxa2) and lung disease , 2014 .

[33]  D. Kopans Digital breast tomosynthesis: a better mammogram. , 2013, Radiology.

[34]  Woo Kyung Moon,et al.  Breast Cancer Detected at Screening US: Survival Rates and Clinical-Pathologic and Imaging Factors Associated with Recurrence. , 2017, Radiology.

[35]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[36]  J. Huang,et al.  An approach of encoding for prediction of splice sites using SVM. , 2006, Biochimie.

[37]  Joseph K. Pickrell,et al.  Noisy Splicing Drives mRNA Isoform Diversity in Human Cells , 2010, PLoS genetics.

[38]  Jing Li,et al.  Splice sites prediction of Human genome using length-variable Markov model and feature selection , 2010, Expert Syst. Appl..

[39]  W. A. Ericson Introduction to Mathematical Statistics, 4th Edition , 1972 .

[40]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[43]  M. Borodovsky,et al.  TrueSight: a new algorithm for splice junction detection using RNA-seq , 2012, Nucleic acids research.

[44]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[45]  Kwanjeera Wanichthanarak,et al.  Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integration in Precision Medicine , 2018, Omics : a journal of integrative biology.

[46]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  C. Dang,et al.  MYC-Induced Cancer Cell Energy Metabolism and Therapeutic Opportunities , 2009, Clinical Cancer Research.

[49]  J. Collins,et al.  Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles , 2007, PLoS biology.

[50]  Sungroh Yoon,et al.  Boosted Categorical Restricted Boltzmann Machine for Computational Prediction of Splice Junctions , 2015, ICML.

[51]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[52]  Richard Bonneau,et al.  The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo , 2006, Genome Biology.

[53]  Yixin Chen,et al.  Splice site prediction using support vector machines with a Bayes kernel , 2006, Expert Syst. Appl..

[54]  N L Harris,et al.  Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project. , 1990, Methods in enzymology.

[55]  T. Fan,et al.  The metabolic profile of tumors depends on both the responsible genetic lesion and tissue type. , 2012, Cell metabolism.

[56]  E. Pisano,et al.  Consequences of false-positive screening mammograms. , 2014, JAMA internal medicine.

[57]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[58]  Dan Roth,et al.  Splice Site Prediction Using a Sparse Network of Winnows , 2001 .

[59]  G. Krasnov,et al.  Deregulation of glycolysis in cancer: glyceraldehyde-3-phosphate dehydrogenase as a therapeutic target , 2013, Expert opinion on therapeutic targets.

[60]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[61]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[62]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[63]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[64]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[65]  Huai Li,et al.  Unraveling transcriptional regulatory programs by integrative analysis of microarray and transcription factor binding data , 2008, Bioinform..

[66]  M. Urashima,et al.  Profiling gene expression ratios of paired cancerous and normal tissue predicts relapse of esophageal squamous cell carcinoma. , 2003, Cancer research.

[67]  B. Frey,et al.  The human splicing code reveals new insights into the genetic determinants of disease , 2015, Science.

[68]  J. Henry,et al.  Adoption of Electronic Health Record Systems among U . S . Non-Federal Acute Care Hospitals : 2008-2015 , 2013 .

[69]  Marcel H. Schulz,et al.  A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome , 2008, Science.

[70]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[71]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[72]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[73]  Leonardo Collado-Torres,et al.  Rail-RNA: Scalable analysis of RNA-seq splicing and coverage , 2015, bioRxiv.

[74]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[75]  J. L. Li,et al.  High-accuracy splice site prediction based on sequence component and position features. , 2012, Genetics and molecular research : GMR.

[76]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[77]  Anthony Mancuso,et al.  Myc regulates a transcriptional program that stimulates mitochondrial glutaminolysis and leads to glutamine addiction , 2008, Proceedings of the National Academy of Sciences.

[78]  Michael Klompas,et al.  Uses of electronic health records for public health surveillance to advance public health. , 2015, Annual review of public health.

[79]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[80]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[81]  Ranganatha R. Somasagara,et al.  Targeting MYC Dependence by Metabolic Inhibitors in Cancer , 2017, Genes.

[82]  Xiaoqin Wang,et al.  Whole mammogram image classification with convolutional neural networks , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[83]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[84]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[85]  N. Hay,et al.  The pentose phosphate pathway and cancer. , 2014, Trends in biochemical sciences.

[86]  Hong Yu,et al.  A Natural Language Processing System That Links Medical Terms in Electronic Health Record Notes to Lay Definitions: System Development Using Physician Reviews , 2018, Journal of medical Internet research.

[87]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[88]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[89]  Alexander Binder,et al.  Explaining nonlinear classification decisions with deep Taylor decomposition , 2015, Pattern Recognit..

[90]  Gurmit Singh,et al.  Ets-1 Regulates Energy Metabolism in Cancer Cells , 2010, PloS one.

[91]  Jalal Poorolajal,et al.  Breast cancer screening (BCS) chart: a basic and preliminary model for making screening mammography more productive and efficient , 2018, Journal of public health.

[92]  Yi Zhang,et al.  DeepSplice: Deep classification of novel splice junctions revealed by RNA-seq , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[93]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[94]  Larisa M Haupt,et al.  Review: Alternative Splicing (AS) of Genes As An Approach for Generating Protein Complexity , 2013, Current genomics.

[95]  Wolfgang Wiechert,et al.  Visualizing multi-omics data in metabolic networks with the software Omix - A case study , 2011, Biosyst..

[96]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  Tsung-Cheng Chang,et al.  c-Myc suppression of miR-23 enhances mitochondrial glutaminase and glutamine metabolism , 2009, Nature.

[98]  David J. Arenillas,et al.  JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles , 2015, Nucleic Acids Res..

[99]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[100]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[101]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[102]  Jason Li,et al.  Splice site identification using probabilistic parameters and SVM classification , 2006, BMC Bioinformatics.

[103]  L. Tabár,et al.  Swedish two-county trial: impact of mammographic screening on breast cancer mortality during 3 decades. , 2011, Radiology.

[104]  L. Zerbini Oncogenic Transcription Factors: Target Genes , 2007 .

[105]  Yi Zhang,et al.  TFmeta: A Machine Learning Approach to Uncover Transcription Factors Governing Metabolic Reprogramming , 2018, BCB.

[106]  Juan M. Vaquerizas,et al.  Comprehensive reanalysis of transcription factor knockout expression data in Saccharomyces cerevisiae reveals many new targets , 2010, Nucleic acids research.

[107]  Ezekiel J. Maier,et al.  Mapping functional transcription factor networks from gene expression data , 2013, Genome research.

[108]  John M. Asara,et al.  ZBTB7A acts as a tumor suppressor through the transcriptional repression of glycolysis , 2014, Genes & development.

[109]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[110]  Gisella Gennaro,et al.  Digital breast tomosynthesis versus digital mammography: a clinical performance study , 2010, European Radiology.

[111]  Gustavo Carneiro,et al.  Fully automated classification of mammograms using deep residual neural networks , 2017, 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017).

[112]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[113]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[114]  P. Geurts,et al.  Inferring Regulatory Networks from Expression Data Using Tree-Based Methods , 2010, PloS one.

[115]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[116]  T. Libermann,et al.  Targeting transcription factors for cancer gene therapy. , 2006, Current gene therapy.

[117]  D. da Silva,et al.  Differential expression of phosphofructokinase-1 isoforms correlates with the glycolytic efficiency of breast cancer cells. , 2010, Molecular genetics and metabolism.