PiQA: an algebra for querying protein data sets

Life science researchers frequently need to query large protein data sets in a variety of different ways. Protein data sets have a rich structure that includes its primary structure, which is described as a sequence of amino acids, and its secondary structure, which is described as a sequence of folding patterns of the protein. Both these structures are important as the amino acid sequence is often used to find homologous proteins, and the secondary structure can produce important hints about the functionality of proteins. While there are tools for querying each of these structures independently, there are no tools for declarative querying on both these structures. Even the tools that allow querying on either one of these structures are not based on any formal algebra, and as a result require complex rewriting of the tools programming logic when the "query evaluation plan" changes. This paper introduces PiQA, a Protein Query Algebra, which provides a rich set of algebraic operations on both the primary and secondary structure of proteins. Using PiQA one can pose several interesting complex queries involving both the primary and the secondary structure of proteins. In addition, simple existing tools that query only on the primary structure, such as BLAST, can also be expressed in this algebra. PiQA is an important first step in developing an algebra that can form the basis of a declarative querying language for querying protein data sets.

[1]  Miron Livny,et al.  Sequence query processing , 1994, SIGMOD '94.

[2]  Jignesh M. Patel,et al.  Searching on the Secondary Structure of Protein Sequences , 2002, VLDB.

[3]  Hans-Jörg Schek,et al.  Remarks on the algebra of non first normal form relations , 1982, PODS.

[4]  Matti Nykänen,et al.  Implementing a Declarative String Query Language with String Restructuring , 1999, PADL.

[5]  Markus Schneider,et al.  Genomics Algebra: A New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information , 2003, CIDR.

[6]  S.B. Davidson Tale of two cultures: are there database research issues in bioinformatics? , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[7]  Serge Abiteboul,et al.  Non First Normal Form Relations: An Algebra Allowing Data Restructuring , 1986, J. Comput. Syst. Sci..

[8]  Abraham Silberschatz,et al.  Extended algebra and calculus for nested relational databases , 1988, TODS.

[9]  David C. Jones,et al.  Combining protein evolution and secondary structure. , 1996, Molecular biology and evolution.

[10]  A. Telser Molecular Biology of the Cell, 4th Edition , 2002 .

[11]  Carole A. Goble,et al.  Database Challenges for Genome Information in the Post Sequencing Phase , 1999, DEXA.

[12]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[13]  Annabel E. Todd,et al.  From protein structure to function. , 1999, Current opinion in structural biology.

[14]  David Maier,et al.  Optimizing object queries using an effective calculus , 2000, TODS.

[15]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[16]  Jian Wang,et al.  A keying method for a nested relational database management system , 1992, [1992] Eighth International Conference on Data Engineering.

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  Latha S. Colby A recursive algebra and query optimization for nested relations , 1989, SIGMOD '89.

[19]  Miron Livny,et al.  SEQ: A model for sequence databases , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[20]  C. Chothia,et al.  Determination of protein function, evolution and interactions by structural genomics. , 2001, Current opinion in structural biology.

[21]  Patrick C. Fischer,et al.  Nested Relational Structures , 1986, Adv. Comput. Res..

[22]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..