论文信息 - GEMINI: a computationally-efficient search engine for large gene expression datasets

GEMINI: a computationally-efficient search engine for large gene expression datasets

BackgroundLow-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and use that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data using a keyword, accession number of meta-data tag. However, in this search paradigm the form of the query – a text-based string – is mismatched with the form of the target – a genomic profile.ResultsTo improve access to massive genomic data resources, we have developed a fast search engine, GEMINI, that uses a genomic profile as a query to search for similar genomic profiles. GEMINI implements a nearest-neighbor search algorithm using a vantage-point tree to store a database of n profiles and in certain circumstances achieves an O(logn)$\mathcal {O}(\log n)$ expected query time in the limit. We tested GEMINI on breast and ovarian cancer gene expression data from The Cancer Genome Atlas project and show that it achieves a query time that scales as the logarithm of the number of records in practice on genomic data. In a database with 105 samples, GEMINI identifies the nearest neighbor in 0.05 sec compared to a brute force search time of 0.6 sec.ConclusionsGEMINI is a fast search engine that uses a query genomic profile to search for similar profiles in a very large genomic database. It enables users to identify similar profiles independent of sample label, data origin or other meta-data information.

[1] Hans-Peter Kriegel,et al. The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[2] Peter N. Yianilos,et al. Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[3] Shin'ichi Satoh,et al. The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[4] KatayamaNorio,et al. The SR-tree , 1997 .

[5] D.M. Mount,et al. An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6] Optimum binary search trees , 2004, Acta Informatica.

[7] Rong Chen,et al. GeneChaser: Identifying all biological and clinical conditions in which genes of interest are differentially expressed , 2008, BMC Bioinformatics.

[8] Frank Nielsen,et al. Bregman vantage point trees for efficient nearest Neighbor Queries , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[9] Russ B. Altman,et al. Content-based microarray search using differential expression profiles , 2010, BMC Bioinformatics.

[10] Sharon R Grossman,et al. Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[11] Dennis B. Troup,et al. NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[12] Steven J. M. Jones,et al. Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[13] Guy E. Zinman,et al. ExpressionBlast: mining large, unstructured expression databases , 2013, Nature Methods.

[14] A. Brazma,et al. Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[15] Steven J. M. Jones,et al. Comprehensive molecular portraits of human breast tumours , 2013 .

[16] Kai Li,et al. Targeted exploration and analysis of large cross-platform human transcriptomic compendia , 2015, Nature Methods.