Polyglot and Speech Corpus Tools: A System for Representing, Integrating, and Querying Speech Corpora

Speech datasets from many languages, styles, and sources exist in the world, representing significant potential for scientific studies of speech—particularly given structural similarities among all speech datasets. However, studies using multiple speech corpora remain difficult in practice, due to corpus size, complexity, and differing formats. We introduce open-source software for unified corpus analysis: integrating speech corpora and querying across them. Corpora are stored in a custom ‘polyglot persistence’ scheme that combines three sub-databases mirroring different data types: a Neo4j graph database to represent temporal annotation graph structure, and SQL and InfluxDB databases to represent metaand acoustic data. This scheme abstracts away from the idiosyncratic formats of different speech corpora, while mirroring the structure of different data types improves speed and scalability. A Python API and a GUI both allow for: enriching the database with positional, hierarchical, temporal, and signal measures (e.g. utterance boundaries, f0) that are useful for linguistic analysis; querying the database using a simple query language; and exporting query results to standard formats for further analysis. We describe the software, summarize two case studies using it to examine effects on pitch and duration across languages, and outline planned future development.

[1]  Bruce Connell,et al.  Tone languages and the universality of intrinsic F 0: evidence from Africa , 2002, J. Phonetics.

[2]  Thomas C. Schmidt,et al.  EXMARaLDA – creating, analysing and sharing spoken language corpora for pragmatic research , 2009 .

[3]  Mark Liberman,et al.  Towards an integrated understanding of speaking rate in conversation , 2006, INTERSPEECH.

[4]  Brigitte Bigi,et al.  SPPAS - MULTI-LINGUAL APPROACHES TO THE AUTOMATIC ANNOTATION OF SPEECH , 2015 .

[5]  Matthew Y. Chen Vowel Length Variation as a Function of the Voicing of the Consonant Environment , 1970 .

[6]  L. Burnard,et al.  Mining a Year of Speech , 2011 .

[7]  Florian Schiel,et al.  Signal processing via web services: The use case WebMAUS , 2012 .

[8]  Todd Wareham,et al.  Introducing Phon: A Software Solution for the Study of Phonological Acquisition. , 2006, Proceedings of the ... Annual Boston University Conference on Language Development. Boston University Conference on Language Development.

[9]  Mark Liberman,et al.  Automatic Measurement and Comparison of Vowel Nasalization across Languages , 2011, ICPhS.

[10]  Kyle Gorman,et al.  Prosodylab-aligner: A tool for forced alignment of laboratory speech , 2011 .

[11]  Keelan Evanini,et al.  Intrinsic vowel duration and the post-vocalic voicing effect: some evidence from dialects of north american English , 2009, INTERSPEECH.

[12]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[13]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[14]  D. Whalen,et al.  The universality of intrinsic F0 of vowels , 1995 .

[15]  Jennifer Hay,et al.  LaBB-CAT: an Annotation Store , 2012, ALTA.

[16]  Martin Fowler,et al.  NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence , 2012 .

[17]  Jonathan Harrington,et al.  EMU-SDMS: Advanced speech database management and analysis in R , 2017, Comput. Speech Lang..

[18]  Andrew Rosenberg,et al.  AutoBI - a tool for automatic toBI annotation , 2010, INTERSPEECH.

[19]  Florian Schiel,et al.  The partitur format at BAS , 1997 .

[20]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[22]  Ngoc Thang Vu,et al.  GlobalPhone: A multilingual text & speech database in 20 languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.