Machine Learning Models for Error Detection in Metagenomics and Polyploid Sequencing Data

Metagenomics studies, as well as genomics studies of polyploid species such as wheat, deal with the analysis of high variation data. Such data contain sequences from similar, but distinct genetic chains. This fact presents an obstacle to analysis and research. In particular, the detection of instrumentation errors during the digitalization of the sequences may be hindered, as they can be indistinguishable from the real biological variation inside the digital data. This can prevent the determination of the correct sequences, while at the same time make variant studies significantly more difficult. This paper details a collection of ML-based models used to distinguish a real variant from an erroneous one. The focus is on using this model directly, but experiments are also done in combination with other predictors that isolate a pool of error candidates.

[1]  Eugene V Koonin,et al.  New dimensions of the virus world discovered through metagenomics. , 2010, Trends in microbiology.

[2]  K. Eversole The International Wheat Genome Sequencing Consortium (IWGSC) , 2013 .

[3]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[4]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[5]  Emese Meglécz,et al.  Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing , 2011, BMC Genomics.

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7]  M. Berg,et al.  Metagenomic detection methods in biopreparedness outbreak scenarios. , 2013, Biosecurity and bioterrorism : biodefense strategy, practice, and science.

[8]  E. Allen-Vercoe,et al.  The microbiome: what it means for medicine. , 2014, The British journal of general practice : the journal of the Royal College of General Practitioners.

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  T Laver,et al.  Assessing the performance of the Oxford Nanopore Technologies MinION , 2015, Biomolecular detection and quantification.

[11]  Dimitar Vassilev,et al.  An approach to a metagenomic data processing workflow , 2014, J. Comput. Sci..

[12]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[13]  Raúl Rojas,et al.  Neural Networks - A Systematic Introduction , 1996 .

[14]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[15]  J. Gordon,et al.  Human nutrition, the gut microbiome and the immune system , 2011, Nature.

[16]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[17]  R. Mellado,et al.  Analysis of Metagenomic Data Containing High Biodiversity Levels , 2013, PloS one.

[18]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[19]  P. Kersey,et al.  Analysis of the bread wheat genome using whole genome shotgun sequencing , 2012, Nature.

[20]  Leon Bieber Metagenomics And Its Applications In Agriculture Biomedicine And Environmental Studies , 2016 .

[21]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[22]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[23]  Abolfazl Barzegari,et al.  The microbiome: the forgotten organ of the astronaut's body--probiotics beyond terrestrial limits. , 2012, Future microbiology.

[24]  Emmanuel Dias-Neto,et al.  The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium inaugural meeting report , 2016, Microbiome.

[25]  Manuel Spannagl,et al.  Ancient hybridizations among the ancestral genomes of bread wheat , 2014, Science.

[26]  Dimitar Vassilev,et al.  Machine learning models in error and variant detection in high-variation high-throughput sequencing datasets , 2017, ICCS.

[27]  Dimitar Vassilev,et al.  Machine Learning-Driven Noise Separation in High Variation Genomics Sequencing Datasets , 2018, AIMSA.

[28]  V. Kunin,et al.  Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. , 2009, Environmental microbiology.

[29]  Yanjun Qi Random Forest for Bioinformatics , 2012 .