With Little Power Comes Great Responsibility
暂无分享,去创建一个
Peter Henderson | Dallas Card | Dan Jurafsky | Urvashi Khandelwal | Robin Jia | Kyle Mahowald | Dan Jurafsky | Robin Jia | Peter Henderson | Kyle Mahowald | Dallas Card | Urvashi Khandelwal
[1] F. J. Anscombe,et al. Fixed-Sample-Size Analysis of Sequential Observations , 1954 .
[2] Jacob Cohen,et al. The statistical power of abnormal-social psychological research: a review. , 1962, Journal of abnormal and social psychology.
[3] James J Schlesselman. Case-Control Studies: Design, Conduct, Analysis , 1982 .
[4] S. Duffy. Asymptotic and Exact Power for the McNemar Test and its Analogue with R Controls per Case , 1984 .
[5] Sample size and power for pair-matched case-control studies. , 1987, Statistics in medicine.
[6] Helena Chmura Kraemer,et al. How Many Subjects? Statistical Power Analysis in Research , 1987 .
[7] Stephen Dubin. How many subjects? Statistical power analysis in research , 1990 .
[8] J. Shuster,et al. The 2 x 2 matched-pairs trial: exact unconditional design and analysis. , 1991, Biometrics.
[9] P A Lachenbruch,et al. On the sample size for studies based upon McNemar's test. , 1992, Statistics in medicine.
[10] C. Begg,et al. Operating characteristics of a rank correlation test for publication bias. , 1994, Biometrics.
[11] Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.
[12] Jeffrey D. Scargle,et al. Publication Bias: The “File-Drawer” Problem in Scientific Inference , 2000 .
[13] J. Hoenig,et al. Statistical Practice The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis , 2001 .
[14] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[15] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.
[16] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.
[17] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.
[18] Stefan Riezler,et al. On Some Pitfalls in Automatic Evaluation and Significance Testing for MT , 2005, IEEvaluation@ACL.
[19] Roy Bar-Haim,et al. The Second PASCAL Recognising Textual Entailment Challenge , 2006 .
[20] Janez Demsar,et al. Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..
[21] E. Wagenmakers. A practical solution to the pervasive problems ofp values , 2007, Psychonomic bulletin & review.
[22] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.
[23] Hua Ai,et al. Comparing Spoken Dialog Corpora Collected with Recruited Subjects versus Real Users , 2007, SIGDIAL.
[24] Daniel James O'Keefe,et al. Brief Report: Post Hoc Power, Observed Power, A Priori Power, Retrospective Power, Prospective Power, Achieved Power: Sorting Out Appropriate Uses of Statistical Power Analyses , 2007 .
[25] Joseph Hilbe,et al. Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .
[26] Ido Dagan,et al. The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.
[27] Peter Clark,et al. The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.
[28] Hector J. Levesque,et al. The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
[29] Dan Klein,et al. An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.
[30] Morten Wang Fagerland,et al. The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional , 2013, BMC Medical Research Methodology.
[31] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.
[32] D. Barr,et al. Random effects structure for confirmatory hypothesis testing: Keep it maximal. , 2013, Journal of memory and language.
[33] Brian A. Nosek,et al. Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.
[34] Timothy Baldwin,et al. Randomized Significance Tests in Machine Translation , 2014, WMT@ACL.
[35] Dirk Hovy,et al. What’s in a p-value in NLP? , 2014, CoNLL.
[36] Colin Cherry,et al. A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU , 2014, WMT@ACL.
[37] D. A. Kenny,et al. Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. , 2014, Journal of experimental psychology. General.
[38] Georgios N. Yannakakis,et al. Ratings are Overrated! , 2015, Front. ICT.
[39] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[40] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.
[41] J. Ioannidis,et al. Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature , 2017, PLoS biology.
[42] John P. A. Ioannidis,et al. The Power of Bias in Economics Research , 2017 .
[43] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.
[44] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[45] Myle Ott,et al. Understanding Back-Translation at Scale , 2018, EMNLP.
[46] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[47] Samuel R. Bowman,et al. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.
[48] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.
[49] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[50] Rotem Dror,et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.
[51] Percy Liang,et al. Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.
[52] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[53] Roy Schwartz,et al. Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.
[54] J. Ioannidis. What Have We (Not) Learnt from Millions of Scientific Papers with P Values? , 2019, The American Statistician.
[55] David Gal,et al. Abandon Statistical Significance , 2017, The American Statistician.
[56] Myle Ott,et al. Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.
[57] Qiaozhu Mei,et al. Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation , 2019, EMNLP.
[58] Matthias G. Arend,et al. Statistical power in two-level models: A tutorial based on Monte Carlo simulation. , 2019, Psychological methods.
[59] A. Gelman. Don't Calculate Post-hoc Power Using Observed Estimate of Effect Size. , 2018, Annals of surgery.
[60] A. Gelman,et al. The garden of forking paths : Why multiple comparisons can be a problem , even when there is no “ fishing expedition ” or “ p-hacking ” and the research hypothesis was posited ahead of time ∗ , 2019 .
[61] Albert Gatt,et al. Best practices for the human evaluation of automatically generated text , 2019, INLG.
[62] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[63] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[64] Quoc V. Le,et al. BAM! Born-Again Multi-Task Networks for Natural Language Understanding , 2019, ACL.
[65] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[66] Erfan Sadeqi Azer,et al. Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses , 2019, ACL.
[67] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[68] R. Thomas McCoy,et al. BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance , 2019, BLACKBOXNLP.
[69] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.
[70] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[71] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.
[72] J. Yosinski,et al. Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.
[73] Oren Etzioni,et al. Green AI , 2019, Commun. ACM.