Efficient Transformation of Protein Sequence Databases to Columnar Index Schema

Mass spectrometry is used to sequence proteins and extract bio-markers of biological environments. These bio-markers can be used to diagnose thousands of diseases and optimize biological environments such as bio-gas plants. Indexing of the protein sequence data allows to streamline the experiments and speed up the analysis. In our work, we present a schema for distributed column-based database management systems using a column-oriented index to store sequence data. This leads to the problem, how to transform the protein sequence data from the standard format to the new schema. We analyze four different methods of transformation and evaluate those four different methods. The results show that our proposed extended radix tree has the best performance regarding memory consumption and calculation time. Hence, the radix tree is proved to be a suitable data structure for the transformation of protein sequences into the indexed schema.

[1]  Robert Heyer,et al.  Challenges and perspectives of metaproteomic data analysis. , 2017, Journal of biotechnology.

[2]  Viktor Leis,et al.  The adaptive radix tree: ARTful indexing for main-memory databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[3]  Masami Shishibori,et al.  An efficient compression method for Patricia tries , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[4]  Robert Heyer,et al.  Metaproteomics of complex microbial communities in biogas plants , 2015, Microbial biotechnology.

[5]  Octávio L. Franco,et al.  Metaproteomics as a Complementary Approach to Gut Microbiota in Health and Disease , 2017, Front. Chem..

[6]  Daniela Cecconi,et al.  Pros and cons of peptide isolectric focusing in shotgun proteomics. , 2013, Journal of chromatography. A.

[7]  Eric W. Deutsch,et al.  File Formats Commonly Used in Mass Spectrometry Proteomics* , 2012, Molecular & Cellular Proteomics.

[8]  Robert Heyer,et al.  Protein Identification as a Suitable Application for Fast Data Architecture , 2018, DEXA Workshops.