Sentence Segmentation and Disfluency Detection in Narrative Transcripts from Neuropsychological Tests

Natural Language Processing (NLP) tools aiming at the diagnosis of language impairing dementias generally extract several textual metrics of narrative transcripts. However, the absence of sentence boundary segmentation in transcripts prevents the direct application of NLP methods which rely on these marks to work properly, such as taggers and parsers. We present a method to segment the transcripts into sentences and another to detect the disfluencies present in them, to serve as a preprocessing step for the application of subsequent NLP tools. Our methods use recurrent convolutional neural networks with prosodic, morphosyntactic features, and word embeddings. We evaluated both tasks intrinsically, analyzing the most important features, comparing the proposed methods to simpler ones, and identifying the main hits and misses. In addition, a final method was created to combine all tasks and it was evaluated extrinsically using 9 syntactic metrics of Coh-Metrix-Dementia. In the intrinsic evaluations, we showed that our method achieved (i) state-of-the-art results for the sentence segmentation task on impaired speech, and (ii) results that are similar to related works for the English language for disfluency detection tasks. Regarding the extrinsic evaluation, only 3 metrics showed a statistically significant difference between manual MCI transcripts and those generated by our method, suggesting that our method is capable to preprocess transcriptions to be further analyzed by NLP tools.

[1]  Sandra M. Aluísio,et al.  Evaluating Progression of Alzheimer's Disease by Regression and Classification Methods in a Narrative Language Test in Portuguese , 2016, PROPOR.

[2]  Yang Liu,et al.  Disfluency Detection Using Multi-step Stacked Learning , 2013, NAACL.

[3]  Heidi Christensen,et al.  Punctuation annotation using statistical prosody models. , 2001 .

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Yue Zhang,et al.  Transition-Based Disfluency Detection using LSTMs , 2017, EMNLP.

[6]  Eric Yeh,et al.  Language Analytics for Assessing Brain Health: Cognitive Impairment, Depression and Pre-symptomatic Alzheimer's Disease , 2010, Brain Informatics.

[7]  Andreas Stolcke,et al.  Using Conditional Random Fields for Sentence Boundary Detection in Speech , 2005, ACL.

[8]  Julian Hough,et al.  Joint, Incremental Disfluency Detection and Utterance Segmentation from Speech , 2017, EACL.

[9]  Christoph Meinel,et al.  Punctuation Prediction for Unsegmented Transcript Based on Word Vector , 2016, LREC.

[10]  Brian Roark,et al.  Fully Automated Neuropsychological Assessment for Detecting Mild Cognitive Impairment , 2012, INTERSPEECH.

[11]  C. Julian Chen,et al.  Speech recognition with automatic punctuation , 1999, EUROSPEECH.

[12]  Kathleen C. Fraser,et al.  Sentence segmentation of aphasic speech , 2015, HLT-NAACL.

[13]  Sandra M. Aluísio,et al.  Sentence Segmentation in Narrative Transcripts from Neuropsychological Tests using Recurrent Convolutional Neural Networks , 2016, EACL.

[14]  Tanel Alumäe,et al.  LSTM for punctuation restoration in speech transcripts , 2015, INTERSPEECH.

[15]  James F. Allen,et al.  Deyecting and Correcting Speech Repairs , 1994, ACL.

[16]  Andreas Stolcke,et al.  Comparing HMM, maximum entropy, and conditional random fields for disfluency detection , 2005, INTERSPEECH.

[17]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[18]  Gökhan Tür,et al.  Automatic detection of sentence boundaries and disfluencies based on recognized words , 1998, ICSLP.

[19]  Andreas Stolcke,et al.  A prosody only decision-tree model for disfluency detection , 1997, EUROSPEECH.