Supporting accessibility and reproducibility in language research in the Alveo virtual laboratory

Abstract Reproducibility is an important part of scientific research and studies published in speech and language research usually make some attempt at ensuring that the work reported could be reproduced by other researchers. This paper looks at the current practice in the field relating to the citation and availability of both data and software methods. It is common to use widely available shared datasets in this field which helps to ensure that studies can be reproduced; however a brief survey of recent papers shows a wide range of styles of citation of data only some of which clearly identify the exact data used in the study. Similarly, practices in describing and sharing software artefacts vary considerably from detailed descriptions of algorithms to linked repositories. The Alveo Virtual Laboratory is a web based platform to support research based on collections of text, speech and video. Alveo provides a central repository for language data and provides a set of services for discovery and analysis of data. We argue that some of the features of the Alveo platform may make it easier for researchers to share their data more precisely and cite the exact software tools used to develop published results. Alveo makes use of ideas developed in other areas of science and we discuss these and how they can be applied to speech and language research.

[1]  Steve Cassidy An RDF realisation of LAF in the DADA annotation server , 2010, ACL 2010.

[2]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[3]  Dominique Estival,et al.  The Alveo Virtual Laboratory: A Web Based Repository API , 2014, LREC.

[4]  Azadeh Shakery,et al.  Sentence alignment using local and global information , 2016, Comput. Speech Lang..

[5]  Joaquín González-Rodríguez,et al.  On the use of deep feedforward neural networks for automatic language identification , 2016, Comput. Speech Lang..

[6]  Susann Fiedler,et al.  Badges to Acknowledge Open Practices: A Simple, Low-Cost, Effective Method for Increasing Transparency , 2016, PLoS biology.

[7]  J R Muma,et al.  The need for replication. , 1993, Journal of speech and hearing research.

[8]  Hiram Calvo,et al.  Integrated concept blending with vector space models , 2016, Comput. Speech Lang..

[9]  Sridha Sridharan,et al.  A study of speaker clustering for speaker attribution in large telephone conversation datasets , 2016, Comput. Speech Lang..

[10]  Giuseppe Riccardi,et al.  Semantic language models with deep neural networks , 2016, Comput. Speech Lang..

[11]  Christian S. Collberg,et al.  Repeatability in computer systems research , 2016, Commun. ACM.

[12]  Phil D. Green,et al.  A silent speech system based on permanent magnet articulography and direct synthesis , 2016, Comput. Speech Lang..

[13]  Erhard W. Hinrichs,et al.  WebLicht: Web-Based LRT Services for German , 2010, ACL.

[14]  John Chilton,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update , 2016, Nucleic Acids Res..

[15]  Elizabeth Gilbert,et al.  Reproducibility Project: Results (Part of symposium called "The Reproducibility Project: Estimating the Reproducibility of Psychological Science") , 2014 .

[16]  James Pustejovsky,et al.  The Language Application Grid , 2014, WLSI.

[17]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[18]  Prasanta Kumar Ghosh,et al.  Information theoretic optimal vocal tract region selection from real time magnetic resonance images for broad phonetic class recognition , 2016, Comput. Speech Lang..

[19]  Christophe Roeder,et al.  Reproducibility in Natural Language Processing: A Case Study of Two R Libraries for Mining PubMed/MEDLINE. , 2016, LREC ... International Conference on Language Resources & Evaluation : [proceedings]. International Conference on Language Resources and Evaluation.

[20]  Jason Lilley,et al.  Classification of gender based on cepstral coefficients and spectral moments. , 2010 .

[21]  Eugeni Belda,et al.  Identification and Characterization of Two Novel RNA Viruses from Anopheles gambiae Species Complex Mosquitoes , 2016, PloS one.

[22]  Greg Wilson,et al.  Software Carpentry: Getting Scientists to Write Better Code by Making Them More Productive , 2006, Computing in Science & Engineering.

[23]  Florian Schiel,et al.  Untrained Forced Alignment of Transcriptions and Audio for Language Documentation Corpora using WebMAUS , 2014, LREC.

[24]  Marcin Wlodarczak,et al.  TextGridTools: A TextGrid Processing and Analysis Toolkit for Python , 2013 .

[25]  L. J. Chase,et al.  REPLICATION IN EXPERIMENTAL COMMUNICATION RESEARCH: AN ANALYSIS , 1979 .

[26]  Stephan Oepen,et al.  Off-Road LAF: Encoding and Processing Annotations in NLP Workflows , 2014, LREC.

[27]  Jason Lilley,et al.  A comparison of cepstral coefficients and spectral moments in the classification of Romanian fricatives , 2016, J. Phonetics.

[28]  Zengchang Qin,et al.  Topic modeling of Chinese language beyond a bag-of-words , 2016, Comput. Speech Lang..

[29]  Gemma Boleda,et al.  Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus , 2010, LREC.

[30]  Wen Li,et al.  Unsupervised language identification based on Latent Dirichlet Allocation , 2016, Comput. Speech Lang..