GEMINI: a computationally-efficient search engine for large gene expression datasets

BackgroundLow-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and use that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data using a keyword, accession number of meta-data tag. However, in this search paradigm the form of the query – a text-based string – is mismatched with the form of the target – a genomic profile.ResultsTo improve access to massive genomic data resources, we have developed a fast search engine, GEMINI, that uses a genomic profile as a query to search for similar genomic profiles. GEMINI implements a nearest-neighbor search algorithm using a vantage-point tree to store a database of n profiles and in certain circumstances achieves an O(logn)$\mathcal {O}(\log n)$ expected query time in the limit. We tested GEMINI on breast and ovarian cancer gene expression data from The Cancer Genome Atlas project and show that it achieves a query time that scales as the logarithm of the number of records in practice on genomic data. In a database with 105 samples, GEMINI identifies the nearest neighbor in 0.05 sec compared to a brute force search time of 0.6 sec.ConclusionsGEMINI is a fast search engine that uses a query genomic profile to search for similar profiles in a very large genomic database. It enables users to identify similar profiles independent of sample label, data origin or other meta-data information.

[1]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[2]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[3]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[4]  KatayamaNorio,et al.  The SR-tree , 1997 .

[5]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Optimum binary search trees , 2004, Acta Informatica.

[7]  Rong Chen,et al.  GeneChaser: Identifying all biological and clinical conditions in which genes of interest are differentially expressed , 2008, BMC Bioinformatics.

[8]  Frank Nielsen,et al.  Bregman vantage point trees for efficient nearest Neighbor Queries , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[9]  Russ B. Altman,et al.  Content-based microarray search using differential expression profiles , 2010, BMC Bioinformatics.

[10]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[11]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[12]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[13]  Guy E. Zinman,et al.  ExpressionBlast: mining large, unstructured expression databases , 2013, Nature Methods.

[14]  A. Brazma,et al.  Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[15]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[16]  Kai Li,et al.  Targeted exploration and analysis of large cross-platform human transcriptomic compendia , 2015, Nature Methods.