The boost in speech technologies that we have witnessed over the last decade has allowed us to go from a state of the art in which correctly recognizing strings of words was a major target, to a state in which we aim much beyond words. We aim at extracting meaning, but we also aim at extracting all possible cues that are conveyed by the speech signal. In fact, we can estimate bio-relevant traits such as height, weight, gender, age, physical and mental health. We can also estimate language, accent, emotional and personality traits, and even environmental cues. This wealth of information, that one can now extract with recent advances in machine learning, has motivated an exponentially growing number of speech-based applications that go much beyond the transcription of what a speaker says. In particular, it has motivated many health related applications, namely aiming at non-invasive diagnosis and monitorization of diseases that affect speech. Most of the recent work on speech-based diagnosis tools addresses the extraction of features, and/or the development of sophisticated machine learning classifiers [5,7,12–14,17]. The results have shown remarkable progress, boosted by several joint paralinguistic challenges, but most results are obtained from limited training data acquired in controlled conditions. This talk covers two emerging concerns related to this growing trend. One is the collection of large in-the-wild datasets and the effects of this extended uncontrolled collection in the results [4]. Another concern is how the diagnosis may be done without compromising patient privacy [18]. As a proof-of-concept, we will discuss these two aspects and show our results for two target diseases, Depression and Cold, a selection motivated by the availability of corresponding lab datasets distributed in paralinguistic challenges. The availability of these lab datasets allowed us to build a baseline system for each disease, using a simple neural network trained with common features that have not been optimized for either disease. Given the modular architecture adopted, each component This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references UID/CEC/50021/2013, and SFRH/BD/103402/2014. c © Springer Nature Switzerland AG 2018 T. Dutoit et al. (Eds.): SLSP 2018, LNAI 11171, pp. 3–6, 2018. https://doi.org/10.1007/978-3-030-00810-9_1 4 I. Trancoso et al. of the system can be individually improved at a later stage, although the limited amount of data does not motivate us to exploit deeper networks. Our mining effort has been focused on video blogs (vlogs), that include a single speaker which, at some point, admits that he/she is currently affected by a given disease. Retrieving vlogs with the target disease involves not only a simple query (i.e. depression vlog), but also a postfiltering stage to exclude videos that do not correspond to our target of first person, present experiences (lectures, in particular, are relatively frequent). This filtering stage combines multimodal features automatically extracted from the video and its metadata, using mostly off-the-shelf tools. We collected a large dataset for each target disease from YouTube, and manually labelled a small subset which we named the in-the-Wild Speech Medical (WSM) corpus. Although our mining efforts made use of relatively simple techniques using mostly existing toolkits, they proved effective. The best performing models achieved a precision of 88% and 93%, and a recall of 97% and 72%, for the datasets of Cold and Depression, respectively, in the task of filtering videos containing these speech affecting diseases. We compared the performance of our baseline neural network classifiers trained with data collected in controlled conditions in tests with corresponding in-the-wild data. For the Cold datasets, the baseline neural network achieved an Unweighted Average Recall (UAR) of 66.9% for the controlled dataset, and 53.1% for the manually labelled subset of the WSM corpus. For the Depression datasets, the corresponding values were 60.6%, and 54.8%, respectively (at interview level, the UAR increased to 61.9% for the vlog corpus). The performance degradation that we had anticipated for using in-the-wild data may be due to a greater variability in recording conditions (p.e. microphone, noise) and in the effects of speech altering diseases in the subjects’ speech. Our current work with vlog datasets attempts to estimate the quality of the predicted labels of a very large set in an unsupervised way, using noisy models. The second aspect we addressed was patient privacy. Privacy is an emerging concern among users of voice-activated digital assistants, sparkled by the awareness of devices that must be always in the listening mode. Despite this growing concern, the potential misuse of health related speech based cues has not yet been fully realized. This is the motivation for adopting secure computation frameworks, in which cryptographic techniques are combined with state-of-the-art machine learning algorithms. Privacy in speech processing is an interdisciplinary topic, which was first applied to speaker verification, using Secure Multi-Party Computation, and Secure Modular Hashing techniques [1,15], and later to speech emotion recognition, also using hashing techniques [6]. The most recent efforts on privacy preserving speech processing have followed the progress in secure machine learning, combining neural networks and Full Homomorphic Encryption (FHE) [3,8,9]. In this work, we applied an encrypted neural network, following the FHE paradigm, to the problem of secure detection of pathological speech. This was done by developing an encrypted version of a neural network, trained with unencrypted data, in order to produce encrypted Analysing Speech for Clinical Applications 5 predictions of health-related labels. As proof-of-concept, we used the same two above mentioned target diseases, and compared the performance of the simple neural network classifiers with their encrypted counterparts on datasets collected in controlled conditions. For the Cold dataset, the baseline neural network achieved a UAR of 66.9%, whereas the encrypted network achieved 66.7%. For the Depression dataset, the baseline value was 60.6%, whereas the encrypted network achieved 60.2% (67.9% at interview level). The slight difference in results showed the validity of our secure approach. This approach relies on the computation of features on the client side before encryption, with only the inference stage being computed in an encrypted setting. Ideally, an end-to-end approach would overcome this limitation, but combining convolutional neural networks with FHE imposes severe limitations to their size. Likewise, the use of recurrent layers such as LSTMs (Long Short Term Memory) also requires a number of operations too large for current FHE frameworks, making them computationally unfeasible as well. FHE schemes, by construction, only work with integers, whilst neural networks work with real numbers. By using encoding methods to convert real weights to integers we are throwing away the capability of using an FHE batching technique that would allow us to compute several predictions, at the same time, using the same encrypted value. Recent advances in machine learning have pushed towards the “quantization” and“discretization” of neural networks, so that models occupy less space and operations consume less power. Some works have already implemented these techniques using homomorphic encryption, such as Binarized Neural Networks [10,11,16] and Discretized Neural Networks [2]. The talk will also cover our recent efforts in applying this type of approach to the detection of health related cues in speech signals, while discretizing the network and maximizing the throughput of its encrypted counterpart. More than presenting our recent work in these two aspects of speech analysis for medical applications, this talk intends to point to different directions for future work in these two relatively unexplored topics that were by no means exhausted in this summary.
[1]
Kai Yu,et al.
Multi-task joint-learning of deep neural networks for robust speech recognition
,
2015,
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[2]
Souvik Kundu,et al.
Speaker-aware training of LSTM-RNNS for acoustic modelling
,
2016,
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[3]
Christopher D. Manning,et al.
Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks
,
2015,
ACL.
[4]
Jasha Droppo,et al.
Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning
,
2015,
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[5]
Rico Sennrich,et al.
Neural Machine Translation of Rare Words with Subword Units
,
2015,
ACL.
[6]
Susan T. Dumais,et al.
A Bayesian Approach to Filtering Junk E-Mail
,
1998,
AAAI 1998.
[7]
Thierry Dutoit,et al.
Noise and Speech Estimation as Auxiliary Tasks for Robust Speech Recognition
,
2017,
SLSP.
[8]
Sandeep Subramanian,et al.
Adversarial Generation of Natural Language
,
2017,
Rep4NLP@ACL.
[9]
C. Villani.
Optimal Transport: Old and New
,
2008
.
[10]
Zhizheng Wu,et al.
Improving Trajectory Modelling for DNN-Based Speech Synthesis by Using Stacked Bottleneck Features and Minimum Generation Error Training
,
2016,
IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[11]
Yifan Gong,et al.
An Overview of Noise-Robust Automatic Speech Recognition
,
2014,
IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[12]
Saif Mohammad,et al.
Sentiment after Translation: A Case-Study on Arabic Social Media Posts
,
2015,
NAACL.
[13]
Jon Barker,et al.
An analysis of environment, microphone and data simulation mismatches in robust speech recognition
,
2017,
Comput. Speech Lang..
[14]
Thierry Dutoit,et al.
Speaker-aware Multi-Task Learning for automatic speech recognition
,
2016,
2016 23rd International Conference on Pattern Recognition (ICPR).
[15]
Masanori Morise,et al.
WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications
,
2016,
IEICE Trans. Inf. Syst..
[16]
Yoshua Bengio,et al.
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
,
2015,
ICML.
[17]
Askars Salimbajevs.
Bidirectional LSTM for Automatic Punctuation Restoration
,
2016,
Baltic HLT.
[18]
Tanel Alumäe,et al.
Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration
,
2016,
INTERSPEECH.
[19]
Dong Yu,et al.
An investigation into using parallel data for far-field speech recognition
,
2016,
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[20]
R. Maas,et al.
A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research
,
2016,
EURASIP Journal on Advances in Signal Processing.
[21]
Patrick Kenny,et al.
Front-End Factor Analysis for Speaker Verification
,
2011,
IEEE Transactions on Audio, Speech, and Language Processing.
[22]
John H. L. Hansen,et al.
Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect
,
1994,
IEEE Trans. Speech Audio Process..
[23]
Daniel Povey,et al.
The Kaldi Speech Recognition Toolkit
,
2011
.
[24]
Alan W. Black,et al.
Unit selection in a concatenative speech synthesis system using a large speech database
,
1996,
1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.
[25]
Lukasz Kaiser,et al.
Attention is All you Need
,
2017,
NIPS.
[26]
Gerhard Rigoll,et al.
Multi-task learning strategies for a recurrent neural net in a hybrid tied-posteriors acoustic model
,
2005,
INTERSPEECH.
[27]
Antoine Perquin.
Big deep voice : indexation de données massives de parole grâce à des réseaux de neurones profonds
,
2017
.
[28]
Bhiksha Raj,et al.
Environmental Noise Embeddings for Robust Speech Recognition
,
2016,
ArXiv.
[29]
Zhizheng Wu,et al.
Deep neural network-guided unit selection synthesis
,
2016,
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[30]
Tara N. Sainath,et al.
Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition
,
2016,
INTERSPEECH.
[31]
Samy Bengio,et al.
Tensor2Tensor for Neural Machine Translation
,
2018,
AMTA.
[32]
Diyi Yang,et al.
Hierarchical Attention Networks for Document Classification
,
2016,
NAACL.
[33]
Lior Wolf,et al.
Language Generation with Recurrent Generative Adversarial Networks without Pre-training
,
2017,
ArXiv.
[34]
Marc Delcroix,et al.
Joint acoustic factor learning for robust deep neural network based automatic speech recognition
,
2016,
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[35]
Heiga Zen,et al.
Statistical Parametric Speech Synthesis
,
2007,
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.
[36]
John R. Hershey,et al.
Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks
,
2015,
INTERSPEECH.
[37]
Thierry Dutoit,et al.
Multi-task learning for speech recognition: an overview
,
2016,
ESANN.
[38]
Satoshi Nakamura,et al.
Deep bottleneck features and sound-dependent i-vectors for simultaneous recognition of speech and environmental sounds
,
2016,
2016 IEEE Spoken Language Technology Workshop (SLT).
[39]
Christopher D. Manning,et al.
Baselines and Bigrams: Simple, Good Sentiment and Topic Classification
,
2012,
ACL.
[40]
Dong Wang,et al.
Multi-task recurrent model for speech and speaker recognition
,
2016,
2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).
[41]
Zhizheng Wu,et al.
Merlin: An Open Source Neural Network Speech Synthesis System
,
2016,
SSW.
[42]
Rich Caruana,et al.
Multitask Learning
,
1998,
Encyclopedia of Machine Learning and Data Mining.