A Turing test for artificial expression data

MOTIVATION The lack of reliable, comprehensive gold standards complicates the development of many bioinformatics tools, particularly for the analysis of expression data and biological networks. Simulation approaches can provide provisional gold standards, such as regulatory networks, for the assessment of network inference methods. However, this just defers the problem, as it is difficult to assess how closely simulators emulate the properties of real data. RESULTS In analogy to Turing's test discriminating humans and computers based on responses to questions, we systematically compare real and artificial systems based on their gene expression output. Different expression data analysis techniques such as clustering are applied to both types of datasets. We define and extract distributions of properties from the results, for instance, distributions of cluster quality measures or transcription factor activity patterns. Distributions of properties are represented as histograms to enable the comparison of artificial and real datasets. We examine three frequently used simulators that generate expression data from parameterized regulatory networks. We identify features distinguishing real from artificial datasets that suggest how simulators could be adapted to better emulate real datasets and, thus, become more suitable for the evaluation of data analysis tools. AVAILABILITY See http://www2.bio.ifi.lmu.de/∼kueffner/attfad/ and the supplement for precomputed analyses; other compendia can be analyzed via the CRAN package attfad. The full datasets can be obtained from http://www2.bio.ifi.lmu.de/∼kueffner/attfad/data.tar.gz.

[1]  Ting Wang,et al.  An improved map of conserved regulatory sites for Saccharomyces cerevisiae , 2006, BMC Bioinformatics.

[2]  Adnan Darwiche,et al.  7 Inference in Bayesian Networks : A Historical Perspective , 2009 .

[3]  Ralf Zimmer,et al.  Rigorous assessment of gene set enrichment tests , 2012, Bioinform..

[4]  N. Lytkin,et al.  A comprehensive assessment of methods for de-novo reverse-engineering of genome-scale regulatory networks. , 2011, Genomics.

[5]  Joaquín Dopazo,et al.  Papers on normalization, variable selection, classification or clustering of microarray data , 2009, Bioinform..

[6]  Michael Q. Zhang Inferring Gene Regulatory Networks , 2008 .

[7]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[8]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[9]  Hong Yan,et al.  Cluster analysis of gene expression data based on self-splitting and merging competitive learning , 2004, IEEE Transactions on Information Technology in Biomedicine.

[10]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[11]  P. Khatri,et al.  Global functional profiling of gene expression. , 2003, Genomics.

[12]  David R Westhead,et al.  Inference in Bayesian networks , 2006, Nature Biotechnology.

[13]  S. Teichmann,et al.  Analysis and simulation of gene expression profiles in pure and mixed cell populations , 2011, Physical biology.

[14]  Korbinian Strimmer,et al.  BMC Bioinformatics BioMed Central Methodology article A general modular framework for gene set enrichment analysis , 2009 .

[15]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[16]  Seon-Young Kim,et al.  Gene-set approach for expression pattern analysis , 2008, Briefings Bioinform..

[17]  Julio Collado-Vides,et al.  RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units) , 2010, Nucleic Acids Res..

[18]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Ralf Zimmer,et al.  Inferring gene regulatory networks by ANOVA , 2012, Bioinform..

[20]  Dario Floreano,et al.  GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods , 2011, Bioinform..

[21]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[22]  Andrea Pinna,et al.  Bioinformatics Applications Note Systems Biology Simulating Systems Genetics Data with Sysgensim , 2022 .

[23]  R. Küffner,et al.  Petri Nets with Fuzzy Logic (PNFL): Reverse Engineering and Parametrization , 2010, PloS one.

[24]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[25]  Atul J. Butte,et al.  Unsupervised knowledge discovery in medical databases using relevance networks , 1999, AMIA.

[26]  Jeremiah J. Faith,et al.  Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata , 2007, Nucleic Acids Res..

[27]  Casper J. Albers,et al.  SIMAGE: simulation of DNA-microarray gene expression data , 2006, BMC Bioinformatics.

[28]  Ping Xu,et al.  Computational Statistics and Data Analysis Distribution Modeling and Simulation of Gene Expression Data , 2022 .

[29]  Ralf Zimmer,et al.  Normalization and Gene p-Value Estimation: Issues in Microarray Data Processing , 2008, Bioinformatics and biology insights.

[30]  Ming Wu,et al.  Learning transcriptional regulation on a genome scale: a theoretical analysis based on gene expression data , 2012, Briefings Bioinform..

[31]  Kathleen Marchal,et al.  SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms , 2006, BMC Bioinformatics.

[32]  Michael R. Brent,et al.  Benchmarking regulatory network reconstruction with GRENDEL , 2009, Bioinform..