An entropy-reducing data representation approach for bioinformatic data

Abstract Non-semantic approaches to bioinformatic data analysis have potential relevance where semantic resources such as annotated finished reference genomes are lacking, such as in the analysis and utilisation of growing amounts of sequence data from non-model organisms, often associated with sequence-based agricultural, aqua-cultural and environmental sampling studies and commercial services. Even where rich semantic resources are available, semantic approaches to problems such as contrasting and comparing reference assemblies, and utilising multiple references in parallel to avoid reference bias, are costly and difficult to fully automate. We introduce and discuss a non-semantic data representation approach intended mainly for bioinformatic data called non-semantic labelling. Non-semantic labelling involves tensorially combining multiple kinds of model-based entropy-reducing data representation, with multiple representation models, so as to map both data and models into dual metric representation spaces, with goals of both reducing the statistical complexity of the data, and highlighting latent structure via machine learning and statistical analyses conducted within the dual representation spaces. As part of the framework, we introduce a novel algebraic abstraction of data representation mappings, and present four proof-of-concept examples of its application, to problems such as comparing and contrasting sequence assemblies, utilisation of multiple references for annotation and development of quality control diagnostics in a variety of high-throughput sequencing contexts. Database URL: https://github.com/AgResearch/data_prism

[1]  David Haussler,et al.  The UCSC Genome Browser database: update 2010 , 2009, Nucleic Acids Res..

[2]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[3]  Jiawei Han,et al.  Learning with Tensor Representation , 2006 .

[4]  Christian P. Robert,et al.  The Seven Pillars of Statistical Wisdom , 2018 .

[5]  V. Gladyshev,et al.  Selenoproteins Are Essential for Proper Keratinocyte Function and Skin Development , 2010, PloS one.

[6]  Terrence S. Furey,et al.  The UCSC Genome Browser Database: update 2006 , 2005, Nucleic Acids Res..

[7]  Ashley Montanaro,et al.  Quantum algorithms: an overview , 2015, npj Quantum Information.

[8]  S. Foster,et al.  Abundant Degenerate Miniature Inverted-Repeat Transposable Elements in Genomes of Epichloid Fungal Endophytes of Grasses , 2011, Genome biology and evolution.

[9]  M. Kubát An Introduction to Machine Learning , 2017, Springer International Publishing.

[10]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Robert J. Elshire,et al.  A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species , 2011, PloS one.

[13]  David R S Cumming,et al.  Beyond Moore's law , 2014, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.