Automated annotation of functional imaging experiments via multi-label classification

Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining) methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO), the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes (NB), k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization). Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. NB methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text.

[1]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[2]  A. Laird,et al.  An analysis of functional neuroimaging studies of dorsolateral prefrontal cortical activity in depression , 2006, Psychiatry Research: Neuroimaging.

[3]  Angela M. Uecker,et al.  ALE meta‐analysis: Controlling the false discovery rate and performing statistical contrasts , 2005, Human brain mapping.

[4]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[5]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[6]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[7]  K. Bretonnel Cohen,et al.  A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools , 2012, BMC Bioinformatics.

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[9]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[10]  Angela R Laird,et al.  Brainmap taxonomy of experimental design: Description and evaluation , 2005, Human brain mapping.

[11]  Grigorios Tsoumakas,et al.  An Empirical Study of Lazy Multilabel Classification Algorithms , 2008, SETN.

[12]  Simon B Eickhoff,et al.  Investigating the Functional Heterogeneity of the Default Mode Network Using Coordinate-Based Meta-Analytic Modeling , 2009, The Journal of Neuroscience.

[13]  Everton Alvares Cherman,et al.  Multi-label Problem Transformation Methods: a Case Study , 2011, CLEI Electron. J..

[14]  Harry Zhang,et al.  Exploring Conditions For The Optimality Of Naïve Bayes , 2005, Int. J. Pattern Recognit. Artif. Intell..

[15]  José L. V. Mejino,et al.  A reference ontology for biomedical informatics: the Foundational Model of Anatomy , 2003, J. Biomed. Informatics.

[16]  Antonino Feitosa Neto,et al.  A Comparative Analysis of Classification Methods to Multi-label Tasks in Different Application Domains , 2011 .

[17]  Angela R Laird,et al.  Automated analysis of meta‐analysis networks , 2005, Human brain mapping.

[18]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[19]  Kristina B. Wolff Methods, Case Study , 2007 .

[20]  Jessica A. Turner,et al.  A Probabilistic Framework for Ontology-Based Annotation in Neuroimaging Literature , 2013 .

[21]  Russell A. Poldrack,et al.  Large-scale automated synthesis of human functional neuroimaging data , 2011, Nature Methods.

[22]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[23]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[24]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[25]  Mahesh Panchal,et al.  Experimental Comparison of Different Problem Transformation Methods for Multi-Label Classification using MEKA , 2012 .

[26]  Dolf Trieschnigg,et al.  Proof of concept: concept-based biomedical information retrieval , 2011, SIGF.

[27]  Geoff Holmes,et al.  Multi-label Classification Using Ensembles of Pruned Sets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[28]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[29]  E. Bullmore,et al.  Integrating evidence from neuroimaging and neuropsychological studies of obsessive-compulsive disorder: The orbitofronto-striatal model revisited , 2008, Neuroscience & Biobehavioral Reviews.

[30]  P. Fox,et al.  Mapping context and content: the BrainMap model , 2002, Nature Reviews Neuroscience.

[31]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[32]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[33]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[34]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[35]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[36]  Lawrence Hunter,et al.  Desiderata for ontologies to be used in semantic annotation of biomedical documents , 2011, J. Biomed. Informatics.

[37]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[38]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[39]  Angela R Laird,et al.  Brain activity associated with painfully hot stimuli applied to the upper limb: A meta‐analysis , 2005, Human brain mapping.

[40]  K. Zilles,et al.  An investigation of the structural, connectional, and functional subspecialization in the human amygdala , 2012, Human brain mapping.

[41]  C. Langlotz RadLex: a new method for indexing online educational materials. , 2006, Radiographics : a review publication of the Radiological Society of North America, Inc.

[42]  Jessica A. Turner,et al.  The NIFSTD and BIRNLex Vocabularies: Building Comprehensive Ontologies for Neuroscience , 2008, Neuroinformatics.

[43]  Grigorios Tsoumakas,et al.  Random K-labelsets for Multilabel Classification , 2022 .

[44]  David Madigan,et al.  On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[45]  Jessica A. Turner,et al.  The Cognitive Paradigm Ontology: Design and Application , 2011, Neuroinformatics.

[46]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[47]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[48]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[49]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[50]  Chih-Jen Lin,et al.  A Study on Threshold Selection for Multi-label Classification , 2007 .

[51]  Aniket Kittur,et al.  The Cognitive Atlas: Toward a Knowledge Foundation for Cognitive Neuroscience , 2011, Front. Neuroinform..

[52]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[53]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.