Machine learning for epigenetics and future medical applications

ABSTRACT Understanding epigenetic processes holds immense promise for medical applications. Advances in Machine Learning (ML) are critical to realize this promise. Previous studies used epigenetic data sets associated with the germline transmission of epigenetic transgenerational inheritance of disease and novel ML approaches to predict genome-wide locations of critical epimutations. A combination of Active Learning (ACL) and Imbalanced Class Learning (ICL) was used to address past problems with ML to develop a more efficient feature selection process and address the imbalance problem in all genomic data sets. The power of this novel ML approach and our ability to predict epigenetic phenomena and associated disease is suggested. The current approach requires extensive computation of features over the genome. A promising new approach is to introduce Deep Learning (DL) for the generation and simultaneous computation of novel genomic features tuned to the classification task. This approach can be used with any genomic or biological data set applied to medicine. The application of molecular epigenetic data in advanced machine learning analysis to medicine is the focus of this review.

[1]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[2]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[3]  M. Skinner,et al.  Pesticide Methoxychlor Promotes the Epigenetic Transgenerational Inheritance of Adult-Onset Disease through the Female Germline , 2014, PloS one.

[4]  J. Capra Extrapolating histone marks across developmental stages, tissues, and species: an enhancer prediction case study , 2014, BMC Genomics.

[5]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[6]  Ming Chen,et al.  CompareSVM: supervised, Support Vector Machine (SVM) inference of gene regularity networks , 2014, BMC Bioinformatics.

[7]  Chuang Wu,et al.  Identify High-Quality Protein Structural Models by Enhanced K-Means , 2017, BioMed research international.

[8]  Andrew E. Teschendorff,et al.  A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform , 2012, BMC Bioinformatics.

[9]  Joshy George,et al.  Computational inference of a genomic pluripotency signature in human and mouse stem cells , 2016, Biology Direct.

[10]  Tom R. Gaunt,et al.  HIPred: an integrative approach to predicting haploinsufficient genes , 2017, Bioinform..

[11]  Brian K. Lee,et al.  Presence of an epigenetic signature of prenatal cigarette smoke exposure in childhood. , 2016, Environmental research.

[12]  Dicle Yalcin,et al.  Bioinformatics approaches to single-cell analysis in developmental biology. , 2016, Molecular human reproduction.

[13]  Matthias Hein,et al.  MeDeCom: discovery and quantification of latent components of heterogeneous methylomes , 2017, Genome Biology.

[14]  N. Wray,et al.  DNA Modification Study of Major Depressive Disorder: Beyond Locus-by-Locus Comparisons , 2015, Biological Psychiatry.

[15]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[16]  M. Skinner,et al.  Environmentally induced epigenetic transgenerational inheritance of sperm epimutations promote genetic mutations , 2015, Epigenetics.

[17]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[18]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Miguel Caixinha,et al.  Machine Learning Techniques in Clinical Vision Sciences , 2017, Current eye research.

[20]  Hehuang Xie,et al.  Characterization and machine learning prediction of allele-specific DNA methylation. , 2015, Genomics.

[21]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[22]  David Ballard,et al.  DNA methylation-based forensic age prediction using artificial neural networks and next generation sequencing , 2017, Forensic science international. Genetics.

[23]  B. Goldstein,et al.  Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges , 2016, European heart journal.

[24]  M. Skinner,et al.  Tertiary Epimutations – A Novel Aspect of Epigenetic Transgenerational Inheritance Promoting Genome Instability , 2016, PloS one.

[25]  Erratum to: DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2017, Genome Biology.

[26]  Aaron R. Seitz,et al.  Perceptual learning in visual hyperacuity: A reweighting model , 2011, Vision Research.

[27]  Jason Tsong-Li Wang,et al.  Semi-supervised prediction of gene regulatory networks using machine learning algorithms , 2015, Journal of Biosciences.

[28]  Lawrence Carin,et al.  An Active Learning Approach for Rapid Characterization of Endothelial Cells in Human Tumors , 2014, PloS one.

[29]  Xiaojiang Xu,et al.  Application of machine learning methods to histone methylation ChIP-Seq data reveals H4R3me2 globally represses gene expression , 2010, BMC Bioinformatics.

[30]  Bonnie Berger,et al.  Reconstructing Causal Biological Networks through Active Learning , 2016, PloS one.

[31]  Lawrence B. Holder,et al.  Generalized Query-Based Active Learning to Identify Differentially Methylated Regions in DNA , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Andreas Zell,et al.  Wrapper- and Ensemble-Based Feature Subset Selection Methods for Biomarker Discovery in Targeted Metabolomics , 2011, PRIB.

[33]  A. Hartemink,et al.  Genome-wide prediction of imprinted murine genes. , 2005, Genome research.

[34]  Lennart Martens,et al.  Machine learning applications in proteomics research: How the past can boost the future , 2014, Proteomics.

[35]  Anant Madabhushi,et al.  An active learning based classification strategy for the minority class problem: application to histopathology annotation , 2011, BMC Bioinformatics.

[36]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[37]  Dong Xu,et al.  Classification of lung cancer using ensemble-based feature selection and machine learning methods. , 2015, Molecular bioSystems.

[38]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[39]  Dong Xu,et al.  Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks , 2016, Scientific Reports.

[40]  Fabian Model,et al.  Tumour class prediction and discovery by microarray-based DNA methylation analysis. , 2001, Nucleic acids research.

[41]  R. Iniesta,et al.  Machine learning, statistical learning and the future of biological research in psychiatry , 2016, Psychological Medicine.

[42]  Daiya Takai,et al.  The CpG Island Searcher: A new WWW resource , 2003, Silico Biol..

[43]  L. Holder,et al.  Genome-Wide Locations of Potential Epimutations Associated with Environmentally Induced Epigenetic Transgenerational Inheritance of Disease Using a Sequential Machine Learning Prediction Approach , 2015, PloS one.

[44]  Polina Mamoshina,et al.  Design of efficient computational workflows for in silico drug repurposing. , 2017, Drug discovery today.

[45]  Mark Akeson,et al.  Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands , 2013, Proceedings of the National Academy of Sciences.

[46]  Daiya Takai,et al.  Comprehensive analysis of CpG islands in human chromosomes 21 and 22 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Ronald M. Summers,et al.  Deep Learning in Medical Imaging: Overview and Future Promise of an Exciting New Technique , 2016 .

[48]  M. Skinner,et al.  Environmentally Induced Epigenetic Transgenerational Inheritance of Altered Sertoli Cell Transcriptome and Epigenome: Molecular Etiology of Male Infertility , 2013, PloS one.

[49]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[50]  Robersy Sanchez,et al.  Genome-Wide Discriminatory Information Patterns of Cytosine DNA Methylation , 2015, International journal of molecular sciences.

[51]  Michael K. Skinner,et al.  Environmentally Induced Epigenetic Transgenerational Inheritance of Ovarian Disease , 2012, PloS one.

[52]  Yanjun Qi,et al.  Deep Motif: Visualizing Genomic Sequence Classifications , 2016, ArXiv.

[53]  A. Burgun,et al.  Big Data and machine learning in radiation oncology: State of the art and future prospects. , 2016, Cancer letters.

[54]  Zili Zhang,et al.  Sample Subset Optimization for Classifying Imbalanced Biological Data , 2011, PAKDD.

[55]  Abdellah Tebani,et al.  Omics-Based Strategies in Precision Medicine: Toward a Paradigm Shift in Inborn Errors of Metabolism Investigations , 2016, International journal of molecular sciences.

[56]  Dong Xu,et al.  Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types , 2016, Bioinform..

[57]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[58]  D. Sontag,et al.  Comparison of Approaches for Heart Failure Case Identification From Electronic Health Record Data. , 2016, JAMA cardiology.

[59]  Richard D. Braatz,et al.  A method for learning a sparse classifier in the presence of missing data for high‐dimensional biological datasets , 2017, Bioinform..

[60]  Giosuè Lo Bosco,et al.  Applications of alignment-free methods in epigenomics , 2014, Briefings Bioinform..

[61]  Michael K. Skinner,et al.  Endocrine disruptor induction of epigenetic transgenerational inheritance of disease , 2014, Molecular and Cellular Endocrinology.

[62]  Lana X. Garmire,et al.  Using epigenomics data to predict gene expression in lung cancer , 2015, BMC Bioinformatics.

[63]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[64]  Gianluca Bontempi,et al.  TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages , 2016, F1000Research.

[65]  Paul C. Boutros,et al.  The parameter sensitivity of random forests , 2016, BMC Bioinformatics.

[66]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[67]  George Lee,et al.  Image analysis and machine learning in digital pathology: Challenges and opportunities , 2016, Medical Image Anal..

[68]  Soojin V Yi,et al.  Epigenetics and evolution. , 2014, Integrative and comparative biology.

[69]  Ramón Díaz-Uriarte,et al.  waviCGH: a web application for the analysis and visualization of genomic copy number alterations , 2010, Nucleic Acids Res..

[70]  Seong-Whan Lee,et al.  Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes , 2013, Nucleic acids research.

[71]  Jun Du,et al.  Active Learning with Generalized Queries , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[72]  Huang Hou-Kuan,et al.  Text classification based on the TAN model , 2002, 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering. TENCOM '02. Proceedings..

[73]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[74]  A. Zell,et al.  Linking the Epigenome to the Genome: Correlation of Different Features to DNA Methylation of CpG Islands , 2012, PloS one.

[75]  Michele Ceccarelli,et al.  TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages [version 1; referees: 1 approved, 1 approved with reservations] , 2016 .

[76]  A. Shevchenko,et al.  Enlightening discriminative network functional modules behind Principal Component Analysis separation in differential-omic science studies , 2017, Scientific Reports.

[77]  M. Skinner,et al.  Ancestral dichlorodiphenyltrichloroethane (DDT) exposure promotes epigenetic transgenerational inheritance of obesity , 2013, BMC Medicine.

[78]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[79]  Albert Y. Zomaya,et al.  A particle swarm based hybrid system for imbalanced medical data sampling , 2009, BMC Genomics.

[80]  M. Benton,et al.  Methylome-wide association study of whole blood DNA in the Norfolk Island isolate identifies robust loci associated with age , 2017, Aging.

[81]  Joseph R. Ecker,et al.  Detection of allele-specific methylation through a generalized heterogeneous epigenome model , 2012, Bioinform..

[82]  S. Bekiranov,et al.  Combinatorial epigenetic patterns as quantitative predictors of chromatin biology , 2014, BMC Genomics.

[83]  Role of CpG deserts in the epigenetic transgenerational inheritance of differential DNA methylation regions , 2014, BMC Genomics.

[84]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[85]  Gianluca Bontempi,et al.  Portraying breast cancers with long noncoding RNAs , 2016, Science Advances.

[86]  Michael K. Skinner,et al.  Epigenetic Transgenerational Actions of Endocrine Disruptors and Male Fertility , 2005, Science.

[87]  Alex Zhavoronkov,et al.  Applications of Deep Learning in Biomedicine. , 2016, Molecular pharmaceutics.

[88]  Ankush Mittal,et al.  Localized motif discovery in gene regulatory sequences , 2010, Bioinform..

[89]  Peter Kaiser,et al.  Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning , 2009, PLoS Comput. Biol..

[90]  Pablo D. Reeb,et al.  Assessing Dissimilarity Measures for Sample-Based Hierarchical Clustering of RNA Sequencing Data Using Plasmode Datasets , 2015, PloS one.

[91]  A. Hartemink,et al.  Computational and experimental identification of novel human imprinted genes. , 2007, Genome research.

[92]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[93]  Emanuel J. V. Gonçalves,et al.  A Landscape of Pharmacogenomic Interactions in Cancer , 2016, Cell.