A Random Forest Classifier for Prokaryotes Gene Prediction

Metagenomics is related to the study of microbial genomes, known as metagenomes, describing them through their microorganisms compositions, relationships and activities, thus allowing a greater knowledge about the fundamentals of life and the broad microbial diversity. One way to accomplish such task is by analyzing information from genes contained in metagenomes. The process to identify genes in DNA sequences are usually called gene prediction. This work presents a new gene predictor using the Random Forest classifier. The proposed model obtaining better classification results when compared to state-of-the-art gene prediction tools widely used by the bioinformatics community. Random Forest presented more robust results, being 27% better than Prodigal and 20% better than FragGeneScan w.r.t AUC values while using the independent test set. Feature engineering has been revisited in the gene prediction problem, reinforcing the importance of careful evaluation of assembly a good feature set. K-mer counting features can been seen as the fundamental model building blocks to develop robust gene predictors.

[1]  Huaiqiu Zhu,et al.  Gene prediction in metagenomic fragments based on the SVM algorithm , 2011, 2011 4th International Conference on Biomedical Engineering and Informatics (BMEI).

[2]  Katharina J. Hoff,et al.  Orphelia: predicting genes in metagenomic sequencing reads , 2009, Nucleic Acids Res..

[3]  Luigi Palopoli,et al.  Automatic simulation of RNA editing in plants for the identification of novel putative Open Reading Frames , 2017, PeerJ Prepr..

[4]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[5]  Shao-Wu Zhang,et al.  Gene Prediction in Metagenomic Fragments with Deep Learning , 2017, BioMed research international.

[6]  Ronnie Alves,et al.  Towards an Ensemble Learning Strategy for Metagenomic Gene Prediction , 2014, BSB.

[7]  Abeer Hashem,et al.  Exploring the Human Microbiome: The Potential Future Role of Next-Generation Sequencing in Disease Diagnosis and Treatment , 2019, Front. Immunol..

[8]  Robert D. Finn,et al.  EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies , 2017, Nucleic Acids Res..

[9]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[10]  Paolo Fontana,et al.  Bioinformatic approaches for functional annotation and pathway inference in metagenomics data , 2012, Briefings Bioinform..

[11]  S Karlin,et al.  Computational DNA sequence analysis. , 1994, Annual review of microbiology.

[12]  Andreas Wilke,et al.  MG-RAST version 4 - lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis , 2019, Briefings Bioinform..

[13]  Nguyen Xuan Hoai,et al.  A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction , 2014, KSE.

[14]  J. R. Lobry,et al.  SeqinR 1.0-2: A Contributed Package to the R Project for Statistical Computing Devoted to Biological Sequences Retrieval and Analysis , 2007 .

[15]  O. Ogunseitan,et al.  Tetranucleotide frequencies in microbial genomes , 1998, Electrophoresis.

[16]  Ana Carolina Lorena,et al.  Gene Essentiality Prediction Using Topological Features From Metabolic Networks , 2018, 2018 7th Brazilian Conference on Intelligent Systems (BRACIS).

[17]  Georgios A. Pavlopoulos,et al.  Metagenomics: Tools and Insights for Analyzing Next-Generation Sequencing Data Derived from Biodiversity Studies , 2015, Bioinformatics and biology insights.

[18]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[19]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[20]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[21]  S. Schuster,et al.  The Definition of Open Reading Frame Revisited. , 2018, Trends in genetics : TIG.