Predictive Analytics on Genomic Data with High-Performance Computing

Recent technological advancements and scientific discoveries have revolutionized the current era of genomics. Next-generation sequencing (NGS) technologies have led to tremendous reduction in the sequencing time and given rise to the production and collection of high volumes of genomic datasets. Predicting protein-coding genes from these copious genomic datasets is significant for the synthesis of protein and the understating of the regulatory function of the non-coding region. Methods have been developed to find protein-coding genes from the genome of organisms. Notwithstanding, the recent data explosion in genomics accentuates the need for more efficient algorithms for gene prediction. In this paper, we explore predictive analytics on genomic data. In particular, we present a scalable naïve Bayes-based algorithm that is deployed over a cluster of Apache Spark framework for efficient prediction of genes in the genome of eukaryotic organisms. Evaluation results on the human genome chromosome GRCh37 and GRCh38 show that effectiveness of our algorithm for predictive analytics on genomic data with high-performance computing. high sensitivity, specificity and accuracy.

[1]  Kevin Y. Yip,et al.  Machine learning and genome annotation: a match meant to be? , 2013, Genome Biology.

[2]  Carson Kai-Sang Leung,et al.  DeepGx: Deep Learning Using Gene Expression for Cancer Classification , 2019, 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[3]  Carson K. Leung,et al.  A Data Science and Engineering Solution for Fast K-Means Clustering of Big Data , 2017, 2017 IEEE Trustcom/BigDataSE/ICESS.

[4]  T. Alioto,et al.  Gene prediction. , 2012, Methods in molecular biology.

[5]  Ernesto Picardi,et al.  Computational methods for ab initio and comparative gene finding. , 2010, Methods in molecular biology.

[6]  C. Leung,et al.  Operon-based approach for the inference of rRNA and tRNA evolutionary histories in bacteria , 2020, BMC Genomics.

[7]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[8]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[9]  M H Meisler Evolutionarily conserved noncoding DNA in the human genome: how much and what for? , 2001, Genome research.

[10]  L. Margulis The Classification and Evolution of Prokaryotes and Eukaryotes , 1974 .

[11]  Alfredo Cuzzocrea,et al.  Game Data Mining: Clustering and Visualization of Online Game Data in Cyber-Physical Worlds , 2017, KES.

[12]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[13]  Yi Pan,et al.  A Comprehensive Review of Emerging Computational Methods for Gene Identification , 2016, J. Inf. Process. Syst..

[14]  Fang-Xiang Wu,et al.  A Global Similarity Learning for Clustering of Single-Cell RNA-Seq Data , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[15]  Pedro Vieira Lamares Martins,et al.  Gene prediction using Deep Learning , 2018 .

[16]  Yadong Wang,et al.  Y-SPCR: A new dimensionality reduction method for gene expression data classification , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[17]  Ujjwal Maulik,et al.  Gene Identification: Classical and Computational Intelligence Approaches , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[18]  Eliathamby Ambikairajah,et al.  Boosting approach to exon detection in DNA sequences , 2008 .

[19]  Rishabh Choudhary,et al.  Comprehensive Review On Supervised Machine Learning Algorithms , 2017, 2017 International Conference on Machine Learning and Data Science (MLDS).

[20]  Gregory Butler,et al.  OrfPredictor: predicting protein-coding regions in EST-derived sequences , 2005, Nucleic Acids Res..

[21]  R. Guigó,et al.  An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[22]  Randal S. Olson,et al.  Data-driven advice for applying machine learning to bioinformatics problems , 2017, PSB.

[23]  Jens Allmer,et al.  Machine learning methods for microRNA gene prediction. , 2014, Methods in molecular biology.

[24]  Oluwafemi A. Sarumi,et al.  Exploiting Anti-Monotonic Constraints in Mining Palindromic Motifs from Big Genomic Data , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[25]  C. Zhang,et al.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. , 2000, Nucleic acids research.

[26]  Carson Kai-Sang Leung,et al.  Mining sequential patterns from uncertain big DNA in the spark framework , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[27]  Jiang Fan,et al.  Mining sequential patterns from uncertain big DNA in the spark framework , 2016 .

[28]  Peter Braun,et al.  Effective Classification of Ground Transportation Modes for Urban Data Mining in Smart Cities , 2018, DaWaK.

[29]  J. Do,et al.  Computational approaches to gene prediction. , 2006, Journal of microbiology.

[30]  Nguyen Xuan Hoai,et al.  A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction , 2014, KSE.

[31]  Jianlin Cheng,et al.  Machine Learning Methods for Protein Structure Prediction , 2008, IEEE Reviews in Biomedical Engineering.

[32]  Peter F Stadler,et al.  A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts , 2017, BMC Genomics.

[33]  Ke Wang,et al.  Fast and Accurate Gene Prediction by Decision Tree Classification , 2010, SDM.

[34]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[35]  R. Durbin,et al.  Using GeneWise in the Drosophila annotation experiment. , 2000, Genome research.

[36]  Carson Kai-Sang Leung,et al.  Spark-based data analytics of sequence motifs in large omics data , 2018, KES.

[37]  J. Wolf,et al.  A field guide to whole-genome sequencing, assembly and annotation , 2014, Evolutionary applications.

[38]  Hunter B. Fraser,et al.  Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing , 2009, Proceedings of the National Academy of Sciences.

[39]  Graziano Pesole,et al.  Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis. , 2003, Nucleic acids research.

[40]  Genshe Chen,et al.  Scalable sentiment classification for Big Data analysis using Naïve Bayes Classifier , 2013, 2013 IEEE International Conference on Big Data.

[41]  Alfredo Cuzzocrea,et al.  An innovative majority voting mechanism in interactive social network clustering , 2017, WIMS.

[42]  J. Claverie Computational methods for the identification of genes in vertebrate genomic sequences. , 1997, Human molecular genetics.

[43]  Yazhu Chen,et al.  A Brief Review of Computational Gene Prediction Methods , 2004, Genomics, proteomics & bioinformatics.

[44]  Nicolas Lachiche,et al.  Reframing in Clustering , 2016, 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI).

[45]  Zhi Wang,et al.  A Machine Learning Approach for Accurate Annotation of Noncoding RNAs , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  Weiguo Liu,et al.  HiPGA: A High Performance Genome Assembler for Short Read Sequence Data , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[47]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.

[48]  Qiang Yang,et al.  Transferring Naive Bayes Classifiers for Text Classification , 2007, AAAI.

[49]  Alfredo Cuzzocrea,et al.  An Innovative Deep-Learning Algorithm for Supporting the Approximate Classification of Workloads in Big Data Environments , 2019, IDEAL.

[50]  Jinbo Bi,et al.  Accelerating Large-Scale Molecular Similarity Search through Exploiting High Performance Computing , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[51]  Christian Bauckhage,et al.  Clustering Game Behavior Data , 2015, IEEE Transactions on Computational Intelligence and AI in Games.

[52]  Alfredo Cuzzocrea,et al.  AI-Based Sensor Information Fusion for Supporting Deep Supervised Learning , 2019, Sensors.