Fast searches of large collections of single cell data using scfind

Single cell technologies have made it possible to profile millions of cells, but for these resources to be useful they must be easy to query and access. To facilitate interactive and intuitive access to single cell data we have developed scfind, a search engine for cell atlases. Using transcriptome data from mouse cell atlases we show how scfind can be used to evaluate marker genes, to perform in silico gating, and to identify both cell-type specific and housekeeping genes. Moreover, we have developed a subquery optimization routine to ensure that long and complex queries return meaningful results. To make scfind more user friendly and accessible, we use indices of PubMed abstracts and techniques from natural language processing to allow for arbitrary queries. Finally, we show how scfind can be used for multi-omics analyses by combining single-cell ATAC-seq data with transcriptome data.

[1]  Andrew C. Adey,et al.  Joint profiling of chromatin accessibility and gene expression in thousands of single cells , 2018, Science.

[2]  Luyi Tian,et al.  Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments , 2019, Nature Methods.

[3]  O. Troyanskaya,et al.  Defining cell-type specificity at the transcriptional level in human disease , 2013, Genome research.

[4]  Erik Cambria,et al.  Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article] , 2014, IEEE Computational Intelligence Magazine.

[5]  M. Hemberg,et al.  Challenges in unsupervised clustering of single-cell RNA-seq data , 2019, Nature Reviews Genetics.

[6]  R. Wears,et al.  Positive-outcome bias and other limitations in the outcome of research abstracts submitted to a scientific meeting. , 1998, JAMA.

[7]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[8]  R. Young,et al.  Super-Enhancers in the Control of Cell Identity and Disease , 2013, Cell.

[9]  J. Hirschhorn,et al.  Biological interpretation of genome-wide association studies using predicted gene functions , 2015, Nature Communications.

[10]  M. Daly,et al.  Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions , 2009, PLoS genetics.

[11]  M. Kanai,et al.  Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases , 2018, Nature Genetics.

[12]  C. Glass,et al.  Deleting an Nr4a1 Super-Enhancer Subdomain Ablates Ly6Clow Monocytes while Preserving Macrophage Gene Function. , 2016, Immunity.

[13]  Evan Z. Macosko,et al.  A Molecular Census of Arcuate Hypothalamus and Median Eminence Cell Types , 2017, Nature Neuroscience.

[14]  M. Rodríguez Martínez,et al.  Context-specific interaction networks from vector representation of words , 2018, Nature Machine Intelligence.

[15]  Sachi Kato,et al.  SCPortalen: human and mouse single-cell centric database , 2017, Nucleic Acids Res..

[16]  James T. Webber,et al.  Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris , 2018, Nature.

[17]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[18]  Hilary Hutchinson,et al.  User Preference and Search Engine Latency , 2008 .

[19]  James T. Webber,et al.  Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris , 2017 .

[20]  Ja Hyun Koo,et al.  LRH1-driven transcription factor circuitry for hepatocyte identity: Super-enhancer cistromic analysis , 2019, EBioMedicine.

[21]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[22]  Ge Tan,et al.  TFBSTools: an R/bioconductor package for transcription factor binding site analysis , 2016, Bioinform..

[23]  Feng Li,et al.  CellMarker: a manually curated resource of cell markers in human and mouse , 2018, Nucleic Acids Res..

[24]  Michael Cariaso,et al.  SNPedia: a wiki supporting personal genome annotation, interpretation and analysis , 2011, Nucleic Acids Res..

[25]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[26]  Oscar Franzén,et al.  PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data , 2019, Database J. Biol. Databases Curation.

[27]  J. Belizário,et al.  Thymic and Postthymic Regulation of Naïve CD4+ T-Cell Lineage Fates in Humans and Mice Models , 2016, Mediators of inflammation.

[28]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[29]  Lars E. Borm,et al.  Molecular Architecture of the Mouse Nervous System , 2018, Cell.

[30]  William S. DeWitt,et al.  A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility , 2018, Cell.

[31]  Stephen C. J. Parker,et al.  Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants , 2013, Proceedings of the National Academy of Sciences.

[32]  Principal Investigators,et al.  Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris , 2018 .

[33]  Aziz Khan,et al.  dbSUPER: a database of super-enhancers in mouse and human genome , 2015, bioRxiv.

[34]  Helen E. Parkinson,et al.  The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) , 2016, Nucleic Acids Res..

[35]  Ruedi Aebersold,et al.  A Mass Spectrometric-Derived Cell Surface Protein Atlas , 2015, PloS one.

[36]  Zhiyong Lu,et al.  PubMed Phrases, an open set of coherent phrases for searching biomedical literature , 2018, Scientific Data.

[37]  P. Reddien,et al.  Fundamentals of planarian regeneration. , 2004, Annual review of cell and developmental biology.

[38]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[39]  Evan Z. Macosko,et al.  Molecular Diversity and Specializations among the Cells of the Adult Mouse Brain , 2018, Cell.

[40]  Pascale Richard,et al.  Identification of two novel mutations in the ventricular regulatory myosin light chain gene (MYL2) associated with familial and classical forms of hypertrophic cardiomyopathy , 1998, Journal of Molecular Medicine.

[41]  Andrew J. Hill,et al.  The single cell transcriptional landscape of mammalian organogenesis , 2019, Nature.

[42]  David J. Arenillas,et al.  JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework , 2017, Nucleic acids research.

[43]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[44]  Zhongming Zhao,et al.  scRNASeqDB: A Database for RNA-Seq Based Gene Expression Profiles in Human Single Cells , 2017, Genes.

[45]  Christoph Hafemeister,et al.  Comprehensive integration of single cell data , 2018, bioRxiv.

[46]  E. Levanon,et al.  Human housekeeping genes, revisited. , 2013, Trends in genetics : TIG.

[47]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[48]  David Haussler,et al.  The UCSC Genome Browser database: 2019 update , 2018, Nucleic Acids Res..

[49]  D. Fanelli Do Pressures to Publish Increase Scientists' Bias? An Empirical Support from US States Data , 2010, PloS one.

[50]  S. Teichmann,et al.  Exponential scaling of single-cell RNA-seq in the past decade , 2017, Nature Protocols.

[51]  The Gene Ontology Consortium Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[52]  Panos Kalnis,et al.  Progress and challenges in bioinformatics approaches for enhancer identification , 2015, Briefings Bioinform..

[53]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[54]  J. Rayner,et al.  The Malaria Cell Atlas: Single parasite transcriptomes across the complete Plasmodium life cycle , 2019, Science.

[55]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[56]  S. Orkin,et al.  Mapping the Mouse Cell Atlas by Microwell-Seq , 2018, Cell.

[57]  Sebastiano Vigna,et al.  Quasi-succinct indices , 2012, WSDM.

[58]  The Gene Ontology Consortium,et al.  Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[59]  Tyler H. Garvin,et al.  Genome-wide compendium and functional assessment of in vivo heart enhancers , 2016, Nature Communications.

[60]  Fabian J Theis,et al.  The Human Cell Atlas , 2017, bioRxiv.

[61]  Xia Yang,et al.  Liver and Adipose Expression Associated SNPs Are Enriched for Association to Type 2 Diabetes , 2010, PLoS genetics.

[62]  David A. Knowles,et al.  Inferring relevant cell types for complex traits using single-cell gene expression , 2017, bioRxiv.

[63]  S. Scherer,et al.  X‐linked Charcot‐Marie‐Tooth disease , 2012, Journal of the peripheral nervous system : JPNS.

[64]  G. Seebohm,et al.  Human pluripotent stem cell-derived cardiomyocytes: Genome-wide expression profiling of long-term in vitro maturation in comparison to human heart tissue , 2015, Genomics data.

[65]  V. Golubovskaya,et al.  Different Subsets of T Cells, Memory, Effector Functions, and CAR-T Immunotherapy , 2016, Cancers.

[66]  S. Quake,et al.  Transcriptomic characterization of 20 organs and tissues from mouse at single cell resolution creates a Tabula Muris , 2017, bioRxiv.

[67]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[68]  Matthew S. Lebo,et al.  Results of clinical genetic testing of 2,912 probands with hypertrophic cardiomyopathy: expanded panels offer limited additional sensitivity , 2015, Genetics in Medicine.

[69]  W SEWELL,et al.  MEDICAL SUBJECT HEADINGS IN MEDLARS. , 1964, Bulletin of the Medical Library Association.

[70]  Nuno A. Fonseca,et al.  ArrayExpress update – from bulk to single-cell expression data , 2018, Nucleic Acids Res..

[71]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.