SSizer: Determining the Sample Sufficiency for Comparative Biological Study.

Comparative biological studies typically require plenty of samples to ensure full representation of the given problem. A frequently-encountered question is how many samples are sufficient for a particular study. This question is traditionally assessed using the statistical power, but it alone may not guarantee full and reproducible discovery of features truly discriminating biological groups. Two new types of statistical criteria have thus been introduced to assess sample sufficiency from different perspectives by considering diagnostic accuracy and robustness. Due to the complementary nature of these criteria, a comprehensive evaluation based on all criteria is necessary for achieving more accurate assessment. However, no such tool is available yet. Herein, an online tool SSizer (https://idrblab.org/ssizer/) was developed and validated to enable the assessment of the sample sufficiency for a user-input biological dataset, and three statistical criteria were adopted to achieve comprehensive and collective assessment. A sample simulation based on user-input dataset was performed to expand the data and then determine the sample size required by particular study. In sum, SSizer is unique for its ability to comprehensively evaluate whether the sample size is sufficient and determine the required number of samples for user-input dataset, which therefore facilitate the comparative and OMIC-based biological studies.

[1]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[2]  Yu Guo,et al.  Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms , 2010, BMC Bioinformatics.

[3]  J. Eng,et al.  Sample Size Estimation : How Many Individuals Should Be Studied ? , 2022 .

[4]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[5]  Basten L. Snoek,et al.  Transcriptome profiling of Ricinus communis L. provides new insights underlying the mechanisms towards thermotolerance during seed imbibition and germination , 2018, Industrial Crops and Products.

[6]  Eytan Domany,et al.  Using high-throughput transcriptomic data for prognosis: a critical overview and perspectives. , 2014, Cancer research.

[7]  A. Carroll,et al.  Untargeted NMR-based metabolomics for field-scale monitoring: Temporal reproducibility and biomarker discovery in mosquitofish (Gambusia holbrooki) from a metal(loid)-contaminated wetland. , 2018, Environmental pollution.

[8]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Lisa N Yelland,et al.  Accounting for twin births in sample size calculations for randomised trials , 2018, Paediatric and perinatal epidemiology.

[10]  Tingting Fu,et al.  Therapeutic target database update 2018: enriched resource for facilitating bench-to-clinic research of targeted therapeutics , 2017, Nucleic Acids Res..

[11]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[12]  J. Friedman Regularized Discriminant Analysis , 1989 .

[13]  William Fenical,et al.  Comparative transcriptomics as a guide to natural product discovery and biosynthetic gene cluster functionality , 2017, Proceedings of the National Academy of Sciences.

[14]  Qing Zeng-Treitler,et al.  Predicting sample size required for classification performance , 2012, BMC Medical Informatics and Decision Making.

[15]  Feng Zhu,et al.  Clinical trials, progression-speed differentiating features and swiftness rule of the innovative targets of first-in-class drugs , 2019, Briefings Bioinform..

[16]  Feng Zhu,et al.  Simultaneous Improvement in the Precision, Accuracy, and Robustness of Label-free Proteome Quantification by Optimizing Data Manipulation Chains* , 2019, Molecular & Cellular Proteomics.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[19]  Josep Ramon Marsal,et al.  Bin-CE: A comprehensive web application to decide upon the best set of outcomes to be combined in a binary composite endpoint , 2018, PloS one.

[21]  R. Gibbs,et al.  Comparative genomics of the miniature wasp and pest control agent Trichogramma pretiosum , 2018, BMC Biology.

[22]  J. Ioannidis Microarrays and molecular research: noise discovery? , 2005, The Lancet.

[23]  R. Hayward,et al.  Type II (β) errors in the hand literature: The importance of power , 1998 .

[24]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[25]  Olivier Thas,et al.  On determining the power of digital PCR experiments , 2018, Analytical and Bioanalytical Chemistry.

[26]  Samantha F Anderson,et al.  Best (but oft forgotten) practices: sample size planning for powerful studies. , 2019, The American journal of clinical nutrition.

[27]  K. Strimbu,et al.  What are biomarkers? , 2010, Current opinion in HIV and AIDS.

[28]  Xiaofeng Li,et al.  ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies , 2019, Briefings Bioinform..

[29]  A. Nobel,et al.  Concordance among Gene-Expression – Based Predictors for Breast Cancer , 2011 .

[30]  David S. Wishart,et al.  MetaboAnalyst 3.0—making metabolomics more meaningful , 2015, Nucleic Acids Res..

[31]  P. Visscher,et al.  OSCA: a tool for omic-data-based complex trait analysis , 2018, Genome Biology.

[32]  Bauke Ylstra,et al.  CGHpower: exploring sample size calculations for chromosomal copy number experiments , 2010, BMC Bioinformatics.

[33]  Bo Li,et al.  NOREVA: normalization and evaluation of MS-based metabolomics data , 2017, Nucleic Acids Res..

[34]  Christoph Steinbeck,et al.  MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data , 2012, Nucleic Acids Res..

[35]  Yan Guo,et al.  RnaSeqSampleSize: real data based sample size estimation for RNA sequencing , 2018, BMC Bioinformatics.

[36]  Feng Zhu,et al.  Assessing the Effectiveness of Direct Data Merging Strategy in Long-Term and Large-Scale Pharmacometabonomics , 2019, Front. Pharmacol..

[37]  Gokmen Zararsiz,et al.  easyROC: An Interactive Web-tool for ROC Curve Analysis Using R Language Environment , 2016, R J..

[38]  David P. Kreil,et al.  The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance , 2014, Nature Biotechnology.

[39]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[40]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[41]  Elaine Holmes,et al.  Power Analysis and Sample Size Determination in Metabolic Phenotyping. , 2016, Analytical chemistry.

[42]  A. Thoma,et al.  A Systematic Review of Power and Sample Size Reporting in Randomized Controlled Trials within Plastic Surgery , 2012, Plastic and reconstructive surgery.

[43]  N. Chandra,et al.  Glycomics and Proteomics Approaches to Investigate Early Adenovirus–Host Cell Interactions , 2018, Journal of Molecular Biology.

[44]  M. Zheng,et al.  High-resolution length fractionation of surfactant-dispersed carbon nanotubes. , 2013, Analytical chemistry.

[45]  J. Koenderink Q… , 2014, Les noms officiels des communes de Wallonie, de Bruxelles-Capitale et de la communaute germanophone.

[46]  Xiaofeng Li,et al.  Consistent gene signature of schizophrenia identified by a novel feature selection strategy from comprehensive sets of transcriptomic data , 2019, Briefings Bioinform..

[47]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[48]  M. van Iterson,et al.  Relative power and sample size analysis on gene expression profiling data , 2009, BMC Genomics.

[49]  Jana Novovicová,et al.  Evaluating Stability and Comparing Output of Feature Selectors that Optimize Feature Subset Cardinality , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  S. Dorus,et al.  Comparative Sperm Proteomics in Mouse Species with Divergent Mating Systems , 2017, Molecular biology and evolution.

[51]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[52]  D. Wishart,et al.  Translational biomarker discovery in clinical metabolomics: an introductory tutorial , 2012, Metabolomics.

[53]  Feng Zhu,et al.  Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics , 2019, Nucleic Acids Res..

[54]  Vincent Navratil,et al.  Sample size calculation in metabolic phenotyping studies , 2015, Briefings Bioinform..

[55]  P. Visscher,et al.  Calculating statistical power in Mendelian randomization studies. , 2013, International journal of epidemiology.

[56]  V M Eguíluz,et al.  The importance of sample size in marine megafauna tagging studies. , 2019, Ecological applications : a publication of the Ecological Society of America.

[57]  Martin Eisenacher,et al.  The PRIDE database and related tools and resources in 2019: improving support for quantification data , 2018, Nucleic Acids Res..

[58]  Feng Zhu,et al.  VARIDT 1.0: variability of drug transporter database , 2019, Nucleic Acids Res..

[59]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[60]  Andrew E Teschendorff,et al.  Avoiding common pitfalls in machine learning omic data science , 2018, Nature Materials.

[61]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[62]  Jonathan Terhorst,et al.  U-PASS: unified power analysis and forensics for qualitative traits in genetic association studies , 2019, bioRxiv.