Content-based search of gene expression databases using binary fingerprints of differential expression profiles

AbstractAvailability and rapid growth of microarray databases have made an integrated analysis of these databases computationally challenging. We present a novel approach to content-based searching in microarray databases, using binary vector representations, that is inspired from the Chemoinformatics field. A benchmark compendium of microarray datasets is established for evaluation of content-based searching. Differential expression profiles from microarray experiments are represented either as floating point vectors or as concise binary vectors. The benchmark compendium is searched using several distance measures for determining similarity. We demonstrate that the use of binary vector representations achieves accuracies equivalent to or better than the use of floating point measures, while at the same time significantly reducing the time required to search a microarray database, owing to the fast bitwise operations and the reduction in memory requirements. Experiments on a large database of binary vector representations demonstrate that a modified Tanimoto distance measure is best suited for content-based search of differential microarray profiles. The search method is available as a web service at: http://sacan.biomed.drexel.edu/mageoindex/.

[1]  Eric P. Hoffman,et al.  Expression Profiling in the Muscular Dystrophies Identification of Novel Aspects of Molecular Pathophysiology , 2000 .

[2]  P Willett,et al.  Similarity-based approaches to virtual screening. , 2003, Biochemical Society transactions.

[3]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[4]  Wendy W Chapman,et al.  Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers , 2010, Journal of biomedical discovery and collaboration.

[5]  F. Kashanchi,et al.  Gene expression profile of HIV-1 Tat expressing cells: a close interplay between proliferative and differentiation signals , 2002, BMC Biochemistry.

[6]  Yidong Chen,et al.  GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus , 2008, Bioinform..

[7]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[8]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[9]  David J. Rogers,et al.  A Computer Program for Classifying Plants II. A Numerical Handling of Non-numerical Data , 1964 .

[10]  D J Rogers,et al.  A Computer Program for Classifying Plants. , 1960, Science.

[11]  B. N. Chatterji,et al.  Comparison of similarity metrics for texture image retrieval , 2003, TENCON 2003. Conference on Convergent Technologies for Asia-Pacific Region.

[12]  G. Shi Multivariate data analysis in palaeoecology and palaeobiogeography—a review , 1993 .

[13]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[14]  Anton H. M. Akkermans,et al.  Face recognition with renewable and privacy preserving binary templates , 2005, Fourth IEEE Workshop on Automatic Identification Advanced Technologies (AutoID'05).

[15]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[16]  Russ B. Altman,et al.  Content-based microarray search using differential expression profiles , 2010, BMC Bioinformatics.

[17]  Pierre Baldi,et al.  Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval , 2007, J. Chem. Inf. Model..

[18]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[19]  Gavin Sherlock,et al.  Implementation of GenePattern within the Stanford Microarray Database , 2008, Nucleic Acids Res..

[20]  Sampsa Hautaniemi,et al.  Therapeutic targets for HIV-1 infection in the host proteome , 2005, Retrovirology.

[21]  Paul B Horton,et al.  RaPiDS: an algorithm for rapid expression profile database search. , 2006, Genome informatics. International Conference on Genome Informatics.

[22]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[23]  G. Trinchieri,et al.  Interleukin 10 (IL-10) inhibits human lymphocyte interferon gamma- production by suppressing natural killer cell stimulatory factor/IL-12 synthesis in accessory cells , 1993, The Journal of experimental medicine.

[24]  H. Parkinson,et al.  A global map of human gene expression , 2010, Nature Biotechnology.

[25]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[26]  Paul Horton,et al.  CellMontage: Similar Expression Profile Search Server , 2007, Bioinform..

[27]  A. Sher,et al.  CD4+ subset regulation in viral infection. Preferential activation of Th2 cells during progression of retrovirus-induced immunodeficiency in mice. , 1992, Journal of immunology.

[28]  Ibrahim Emam,et al.  ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression , 2008, Nucleic Acids Res..

[29]  M. Heller DNA microarray technology: devices, systems, and applications. , 2002, Annual review of biomedical engineering.

[30]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[31]  A. Nobel,et al.  The molecular portraits of breast tumors are conserved across microarray platforms , 2006, BMC Genomics.

[32]  G. Trinchieri,et al.  Stimulatory and inhibitory effects of interleukin (IL)-4 and IL-13 on the production of cytokines by human peripheral blood mononuclear cells: priming for IL-12 and tumor necrosis factor alpha production , 1995, The Journal of experimental medicine.

[33]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[34]  D. Ragoobirsingh,et al.  Decreased insulin binding to mononuclear leucocytes and erythrocytes from dogs after S-Nitroso-N-Acetypenicillamine administration , 2002, BMC Biochemistry.

[35]  Rong Chen,et al.  GeneChaser: Identifying all biological and clinical conditions in which genes of interest are differentially expressed , 2008, BMC Bioinformatics.

[36]  W. Richter,et al.  TGF-β1 as a marker of delayed fracture healing , 2005 .

[37]  Lawrence Hunter,et al.  GEST: a gene expression search tool based on a novel Bayesian similarity metric , 2001, ISMB.

[38]  Stephen C. Harris,et al.  Rat toxicogenomic study reveals analytical consistency across microarray platforms , 2006, Nature Biotechnology.

[39]  Michael Ed. Hohn,et al.  Binary coefficients: A theoretical and empirical study , 1976 .

[40]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[41]  Nasir D. Memon,et al.  Image Steganalysis with Binary Similarity Measures , 2002, Proceedings. International Conference on Image Processing.

[42]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[43]  R. Lahesmaa,et al.  Identification of Novel Genes Regulated by IL-12, IL-4, or TGF-β during the Early Polarization of CD4+ Lymphocytes 1 , 2003, The Journal of Immunology.

[44]  C. Ball,et al.  Submission of Microarray Data to Public Repositories , 2004, PLoS biology.

[45]  D. Steinberg,et al.  Technometrics , 2008 .

[46]  Darren R. Flower,et al.  On the Properties of Bit String-Based Measures of Chemical Similarity , 1998, J. Chem. Inf. Comput. Sci..

[47]  Matthew N. McCall,et al.  The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes , 2010, Nucleic Acids Res..

[48]  R. Coffman,et al.  TH1 and TH2 cells: different patterns of lymphokine secretion lead to different functional properties. , 1989, Annual review of immunology.

[49]  E. Hoffman,et al.  Expression Profiling in the Muscular Dystrophies , 2000, The Journal of cell biology.

[50]  Joseph S. Verducci,et al.  A Modification of the Jaccard–Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings , 2002, Technometrics.

[51]  Syed Mohsin,et al.  Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer , 2003, The Lancet.

[52]  G. Trinchieri,et al.  Interleukin 10 (IL-10) Inhibits Human Lymphocyte Interferon 3,-Production by Suppressing Natural Killer Cell Stimulatory Factor/IL-12 Synthesis in Accessory Cells By Annalisa D'Andrea, Miguel Aste-Amezaga, , 1993 .

[53]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[54]  Sung-Hyuk Cha,et al.  On binary similarity measures for handwritten character recognition , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[55]  Jesse M. Engreitz,et al.  ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression , 2011, Bioinform..

[56]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[57]  J. Tidball,et al.  Helper (CD4(+)) and cytotoxic (CD8(+)) T cells promote the pathology of dystrophin-deficient muscle. , 2001, Clinical immunology.

[58]  Nathan Brown,et al.  Chemoinformatics—an introduction for computer scientists , 2009, CSUR.