Secret sequence comparison in distributed computing environments by interval sampling

Once a new gene has been sequenced, it must be verified whether or not it is similar to previously sequenced genes. In many cases, the organization that sequenced a potentially novel gene needs to keep the sequence itself in confidence. However, to compare the potentially novel sequence with known sequences, it must either be sent as a query to public databases, or these databases must be downloaded onto a local computer. In both cases, the potentially new sequence is exposed to the public. In this work, we propose a new method, called interval sampling, to compare sequences without leaking exact information about the new sequence. In order to keep the exact sequence information secret, this method samples intervals (subsequences) from a sequence, and these intervals are hashed. The hashed data are open to the public to verify the novelty of the sequence. We find that this method works well in parallel in a distributed computing environment, such as the Grid. The experimental results for 19797 h.sapiens genes and 25000 m.musculus genes show that the parallel implementation of this method performs reasonably well in terms of speed and memory usage.

[1]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[2]  Vincent Breton,et al.  Evaluation of Unique Sequences on the European Data Grid , 2003, APBC.

[3]  Hiroshi Nakamura,et al.  A method to find unique sequences on distributed genomic databases , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[4]  Ian Foster,et al.  The Grid: A New Infrastructure for 21st Century Science , 2002 .

[5]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[6]  B. Segal,et al.  Grid computing: the European Data Grid Project , 2000, 2000 IEEE Nuclear Science Symposium. Conference Record (Cat. No.00CH37149).

[7]  H. Nakamura,et al.  A method to verify originality of sequences secretly on distributed computing environment , 2004, Proceedings. Seventh International Conference on High Performance Computing and Grid in Asia Pacific Region, 2004..

[8]  Vincent Breton,et al.  Finding unique PCR products on distributed database , 2004 .

[9]  V Breton,et al.  DataGrid, prototype of a biomedical grid. , 2003, Methods of information in medicine.

[10]  F. Barany,et al.  The ligase chain reaction in a PCR world. , 1991, PCR methods and applications.

[11]  Gary D. Stormo,et al.  Selection of optimal DNA oligos for gene expression arrays , 2001, Bioinform..

[12]  Steven Tuecke,et al.  The Anatomy of the Grid , 2003 .