论文信息 - High Performance GRID Based Implementation for Genomics and Protein Analysis

High Performance GRID Based Implementation for Genomics and Protein Analysis

Starting from the genomic and proteomic sequence data, a complex computational infrastructure as been established with the objective to develop a GRID based system to to automate the analysis, prediction and annotation processes of genomic DNA. To support of this type of analysis, several algorithms as been used to recognize biological signals involved in the identification of genes and proteins. The system implemented can be use to analyse the content of the large number of genomic sequences. For this reason, the system realized is capable of using a computational architecture specifically designed for intensive computing based on GRID technologies developed throughout the BIOINFOGRID European project. We developed a GRID based workflow to correlate different kind of Bioinformatics data, going from the Genomics Nucleotide to the Protein Sequence. The first step in the workflow consists of submitting a nucleotide sequence that is elaborated by a specific software for gene prediction. In particular this tool performs a search in the nucleotide sequence to find out the key components of gene. The predicted gene is then translated in the corresponding protein sequence. Based on protein sequence is then possible to identify the domains that characterize the protein functionality using specific tools of domain prediction. Protein domains classification are very important in the analysis of the macromolecular functionality. To analyze a whole protein family from large genome of various organism means to elaborate a large amount of data that requires huge computational resources. To analyze all this data we suggest the use of a high performance platform based on grid technology. We have implemented our applications on a wide area grid platform for scientific applications [http://www.grid.it and http://grid-it.cnaf.infn.it] composed of about 1000 CPU's. The grid infrastructure consists in a collection of computing elements and storage elements that jointly concur to define a platform for high performance elaboration. In this study a grid based application is presented to compute the protein domain analysis in a distributed way. This approach has high performance because the protein domains are checked with different software in parallel in different grid sites.

Ivan Merelli | Luciano Milanesi