Automatic classification of written descriptions by healthy adults: An overview of the application of natural language processing and machine learning techniques to clinical discourse analysis

Discourse production is an important aspect in the evaluation of brain-injured individuals. We believe that studies comparing the performance of brain-injured subjects with that of healthy controls must use groups with compatible education. A pioneering application of machine learning methods using Brazilian Portuguese for clinical purposes is described, highlighting education as an important variable in the Brazilian scenario. Objective The aims were to describe how to: (i) develop machine learning classifiers using features generated by natural language processing tools to distinguish descriptions produced by healthy individuals into classes based on their years of education; and (ii) automatically identify the features that best distinguish the groups. Methods The approach proposed here extracts linguistic features automatically from the written descriptions with the aid of two Natural Language Processing tools: Coh-Metrix-Port and AIC. It also includes nine task-specific features (three new ones, two extracted manually, besides description time; type of scene described – simple or complex; presentation order – which type of picture was described first; and age). In this study, the descriptions by 144 of the subjects studied in Toledo18 were used,which included 200 healthy Brazilians of both genders. Results and Conclusion A Support Vector Machine (SVM) with a radial basis function (RBF) kernel is the most recommended approach for the binary classification of our data, classifying three of the four initial classes. CfsSubsetEval (CFS) is a strong candidate to replace manual feature selection methods.

[1]  Bernadette Ska,et al.  Production of narratives: Picture sequence facilitates organizational but not conceptual processing in less educated subjects , 2001, Brain and Cognition.

[2]  Lisa M. Bonnici,et al.  A Rubric for Extracting Idea Density from Oral Language Samples , 2012, Current protocols in neuroscience.

[3]  Heather Harris Wright,et al.  Evaluating measures of global coherence ability in stories in adults. , 2013, International journal of language & communication disorders.

[4]  Eric Laporte,et al.  UNITEX-PB, a set of flexible language resources for Brazilian Portuguese , 2005 .

[5]  Sandra M. Aluísio,et al.  Análise da Inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Português , 2010, Linguamática.

[6]  Ricardo Nitrini,et al.  Illiteracy: the neuropsychology of cognition without reading. , 2010, Archives of Clinical Neuropsychology.

[7]  Lucia Specia,et al.  Readability Assessment for Text Simplification , 2010 .

[8]  A. Cantagallo,et al.  Narrative discourse in anomic aphasia , 2012, Neuropsychologia.

[9]  P V Cooper,et al.  Discourse production and normal aging: performance on oral picture description tasks. , 1990, Journal of gerontology.

[10]  Michael Cannizzaro,et al.  Analysis of Narrative Discourse Structure as an Ecologically Relevant Measure of Executive Function in Adults , 2012, Journal of Psycholinguistic Research.

[11]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[12]  G. Le Dorze,et al.  Effects of age and education on the lexico-semantic content of connected speech in adults. , 1998, Journal of communication disorders.

[13]  Débora Cristina Alves,et al.  PERFORMANCE DE MORADORES DA GRANDE SÃO PAULO NA DESCRIÇÃO DA PRANCHA DO ROUBO DE BISCOITOS , 2005 .

[14]  L. Togher,et al.  Discourse sampling in the 21st century. , 2001, Journal of communication disorders.

[15]  M. Grossman,et al.  Trying to tell a tale , 2006, Neurology.

[16]  E. Maziero,et al.  Automatic Identification of Multi-document Relations , 2012 .

[17]  K. Forbes-McKay,et al.  Detecting subtle spontaneous language decline in early Alzheimer’s disease with a picture description task , 2005, Neurological Sciences.

[18]  Brian Roark,et al.  Spoken Language Derived Measures for Detecting Mild Cognitive Impairment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Arthur C. Graesser,et al.  Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[20]  Sergio Carlomagno,et al.  Age-related Differences in the Production of Textual Descriptions , 2005, Journal of psycholinguistic research.

[21]  S. Hendricks,et al.  Incorporating computer-aided language sample analysis into clinical practice. , 2010, Language, speech, and hearing services in schools.

[22]  E. Armstrong,et al.  Aphasic discourse analysis: The story so far , 2000 .

[23]  Margaret Forbes,et al.  AphasiaBank: Methods for studying discourse , 2011, Aphasiology.

[24]  C. Mackenzie,et al.  Adult spoken discourse: the influences of age and education. , 2000, International journal of language & communication disorders.

[25]  Heather Harris Wright,et al.  Attention and Off-Topic Speech in the Recounts of Middle-Age and Elderly Adults: A Pilot Investigation. , 2012, Contemporary issues in communication science and disorders : CICSD.

[26]  Maria Tereza Camargo Biderman,et al.  Dicionario ilustrado de portugues , 2008 .

[27]  J. Neils,et al.  Effects of age, education, and living environment on Boston Naming Test performance. , 1995, Journal of speech and hearing research.

[28]  Kathleen C. Fraser,et al.  Automated classification of primary progressive aphasia subtypes from narrative speech transcripts , 2014, Cortex.

[29]  Maria Alice Pimenta,et al.  Ativação de modelos mentais no recontar de histórias por idosos , 1999 .

[30]  Cíntia Matsuda Toledo,et al.  Variáveis sociodemográficas na produção do discurso em adultos sadios , 2011 .