Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers

Background The ability to locate publicly available gene expression microarray datasets effectively and efficiently facilitates the reuse of these potentially valuable resources. Centralized biomedical databases allow users to query dataset metadata descriptions, but these annotations are often too sparse and diverse to allow complex and accurate queries. In this study we examined the ability of PubMed article identifiers to locate publicly available gene expression microarray datasets, and investigated whether the retrieved datasets were representative of publicly available datasets found through statements of data sharing in the associated research articles. Results In a recent article, Ochsner and colleagues identified 397 studies that had generated gene expression microarray data. Their search of the full text of each publication for statements of data sharing revealed 203 publicly available datasets, including 179 in the Gene Expression Omnibus (GEO) or ArrayExpress databases. Our scripted search of GEO and ArrayExpress for PubMed identifiers of the same 397 studies returned 160 datasets, including six not found by the original search for data sharing statements. As a proportion of datasets found by either method, the search for data sharing statements identified 91.4% of the 209 publicly available datasets, compared to 76.6% found by our search for PubMed identifiers. Searching GEO or ArrayExpress alone retrieved 63.2% and 46.9% of all available datasets, respectively. Studies retrieved through PubMed identifiers were representative of all datasets in terms of research theme, technology, size, and impact, though the recall was highest for datasets published by the highest-impact journals. Conclusions Searching database entries using PubMed identifiers can identify the majority of publicly available datasets. We urge authors of all datasets to complete the citation fields for their dataset submissions once publication details are known, thereby ensuring their work has maximum visibility and can contribute to subsequent studies.

[1]  T. N. Bhat,et al.  The PDB data uniformity project , 2001, Nucleic Acids Res..

[2]  Nigel W. Hardy,et al.  Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project , 2008, Nature Biotechnology.

[3]  Daniel L. Rubin,et al.  Annotation and query of tissue microarray data using the NCI Thesaurus , 2007, BMC Bioinformatics.

[4]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[5]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[6]  Eleanor Howe,et al.  MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms , 2005, Bioinform..

[7]  Philip E. Bourne,et al.  BioLit: integrating biological literature with databases , 2008, Nucleic Acids Res..

[8]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..

[9]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[10]  A. Butte,et al.  Creation and implications of a phenome-genome network , 2006, Nature Biotechnology.

[11]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[12]  Wendy W. Chapman,et al.  A review of journal policies for sharing research data , 2008, ELPUB.

[13]  Joel Dudley,et al.  Enabling Integrative Genomic Analysis of High Impact Human Diseases Through Text Mining , 2007, Pacific Symposium on Biocomputing.

[14]  Thomas Werner,et al.  The next generation of literature analysis: Integration of genomic analysis into text mining , 2005, Briefings Bioinform..

[15]  Heather A. Piwowar,et al.  Sharing Detailed Research Data Is Associated with Increased Citation Rate , 2007, PloS one.

[16]  Ann M. Richard,et al.  Toward a public toxicogenomics capability for supporting predictive toxicology: survey of current resources and chemical indexing of experiments in GEO and ArrayExpress. , 2009, Toxicological sciences : an official journal of the Society of Toxicology.

[17]  Christian J Stoeckert,et al.  Much room for improvement in deposition rates of expression microarray datasets , 2008, Nature Methods.

[18]  Rong Chen,et al.  Methodologies for Extracting Functional Pharmacogenomic Experiments from International Repository , 2007, AMIA.

[19]  Rong Chen,et al.  Finding Disease-Related Genomic Experiments Within an International Repository: First Steps in Translational Bioinformatics , 2006, AMIA.

[20]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.