Thought Recognition: Predicting and Decoding Brain Activity Using the Zero-Shot Learning Model

Machine learning algorithms have been successfully applied to learning classifiers in many domains such as computer vision, fraud detection, and brain image analysis. Typically, classifiers are trained to predict a class value given a set of labeled training data that includes all possible class values, and sometimes additional unlabeled training data. Little research has been performed where the possible values for the class variable include values that have been omitted from the training examples. This is an important problem setting, especially in domains where the class value can take on many values, and the cost of obtaining labeled examples for all values is high. We show that the key to addressing this problem is not predicting the held-out classes directly, but rather by recognizing the semantic properties of the classes such as their physical or functional attributes. We formalize this method as zero-shot learning and show that by utilizing semantic knowledge mined from large text corpora and crowd-sourced humans, we can discriminate classes without explicitly collecting examples of those classes for a training set. As a case study, we consider this problem in the context of thought recognition, where the goal is to classify the pattern of brain activity observed from a non-invasive neural recording device. Specifically, we train classifiers to predict a specific concrete noun that a person is thinking about based on an observed image of that person's neural activity. We show that by predicting the semantic properties of the nouns such as "is it heavy?" and "is it edible?", we can discriminate concrete nouns that people are thinking about, even without explicitly collecting examples of those nouns for a training set. Further, this allows discrimination of certain nouns that are within the same category with significantly higher accuracies than previous work. In addition to being an important step forward for neural imaging and brain-computer-interfaces, we show that the zero-shot learning model has important implications for the broader machine learning community by providing a means for learning algorithms to extrapolate beyond their explicit training set.

[1]  Joel A. Tropp,et al.  ALGORITHMS FOR SIMULTANEOUS SPARSE APPROXIMATION , 2006 .

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Tzyy-Ping Jung,et al.  Noninvasive Neural Prostheses Using Mobile and Wireless EEG , 2008, Proceedings of the IEEE.

[4]  Martin J. Wainwright,et al.  Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting , 2007, IEEE Transactions on Information Theory.

[5]  Vladimir Vovk,et al.  Kernel Ridge Regression , 2013, Empirical Inference.

[6]  Tom Michael Mitchell,et al.  A Neurosemantic Theory of Concrete Noun Representation Based on the Underlying Brain Codes , 2010, PloS one.

[7]  Norbert Schuff,et al.  A non-parametric approach for co-analysis of multi-modal brain imaging data: Application to Alzheimer's disease , 2006, NeuroImage.

[8]  Klaus Obermayer,et al.  Support vector learning for ordinal regression , 1999 .

[9]  Tom M. Mitchell,et al.  Classification in Very High Dimensional Problems with Handfuls of Examples , 2007, PKDD.

[10]  Ian H. Witten,et al.  Using a Permutation Test for Attribute Selection in Decision Trees , 1998, ICML.

[11]  Andrew W. Moore,et al.  Rule-based anomaly pattern detection for detecting disease outbreaks , 2002, AAAI/IAAI.

[12]  Andrew B Schwartz,et al.  Cortical neural prosthetics. , 2004, Annual review of neuroscience.

[13]  Tom Michael Mitchell,et al.  From the SelectedWorks of Marcel Adam Just 2008 Using fMRI brain activation to identify cognitive states associated with perception of tools and dwellings , 2016 .

[14]  Michael Wilson,et al.  MRC psycholinguistic database: Machine-usable dictionary, version 2.00 , 1988 .

[15]  Eric P. Xing,et al.  Ultra-high Dimensional Multiple Output Learning With Simultaneous Orthogonal Matching Pursuit: Screening Approach , 2010, AISTATS.

[16]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[17]  Nathan Ratliff,et al.  Online) Subgradient Methods for Structured Prediction , 2007 .

[18]  Indrayana Rustandi,et al.  Integrating Multiple-Study Multiple-Subject fMRI Datasets Using Canonical Correlation Analysis , 2009 .

[19]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[20]  S. Taulu,et al.  The Signal Space Separation method , 2004, physics/0401166.

[21]  Kyuwan Choi,et al.  Control of a Wheelchair by Motor Imagery in Real Time , 2008, IDEAL.

[22]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[23]  R. Ilmoniemi,et al.  Interpreting magnetic fields of the brain: minimum norm estimates , 2006, Medical and Biological Engineering and Computing.

[24]  Volker Roth,et al.  The Group-Lasso: l 1, INFINITY Regularization versus l 1, 2 Regularization. , 2010 .

[25]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[26]  Ernst Fernando Lopes Da Silva Niedermeyer,et al.  Electroencephalography, basic principles, clinical applications, and related fields , 1982 .

[27]  Alexandr Andoni,et al.  Hardness of Nearest Neighbor under L-infinity , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[28]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[29]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[30]  Dejan S. Milojicic,et al.  Open Cirrus: A Global Cloud Computing Testbed , 2010, Computer.

[31]  Niels Birbaumer,et al.  Grand Challenges of Brain Computer Interfaces in the Years to Come , 2009, Front. Neurosci..

[32]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[33]  A. Caprihan,et al.  Application of principal component analysis to distinguish patients with schizophrenia from healthy controls based on fractional anisotropy measurements , 2008, NeuroImage.

[34]  Ivo Grosse,et al.  Gene selection criterion for discriminant microarray data analysis based on extreme value distributions , 2003, RECOMB '03.

[35]  Daniel H. Mathalon,et al.  Diffusion tensor imaging in schizophrenia: Relationship to symptoms , 2008, Schizophrenia Research.

[36]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[37]  G. Edlinger,et al.  State-of-the-Art in BCI Research: BCI Award 2010 , 2011 .

[38]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[39]  D. Plaut Graded modality-specific specialisation in semantics: A computational account of optic aphasia , 2002, Cognitive neuropsychology.

[40]  Naum Zuselevich Shor,et al.  Minimization Methods for Non-Differentiable Functions , 1985, Springer Series in Computational Mathematics.

[41]  Matthew H. Davis,et al.  Detecting Awareness in the Vegetative State , 2006, Science.

[42]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[43]  Tom M. Mitchell,et al.  Machine learning classifiers and fMRI: A tutorial overview , 2009, NeuroImage.

[44]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[45]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[46]  R. Ilmoniemi,et al.  Magnetoencephalography-theory, instrumentation, and applications to noninvasive studies of the working human brain , 1993 .

[47]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[48]  Jonathan Baxter,et al.  A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.

[49]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[50]  L. Shah,et al.  Functional magnetic resonance imaging. , 2010, Seminars in roentgenology.

[51]  Dimitris Samaras,et al.  Exploiting Temporal Information in Functional Magnetic Resonance Imaging Brain Data , 2005, MICCAI.

[52]  José del R. Millán,et al.  Brain-Computer Interfaces , 2020, Handbook of Clinical Neurology.

[53]  Gang Wang,et al.  Comparative object similarity for improved recognition with few or no examples , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[55]  Manfred Jaeger,et al.  Proceedings of the 24th Annual International Conference on Machine Learning (ICML 2007) , 2007, ICML 2007.

[56]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[57]  Tom M. Mitchell,et al.  Exploiting parameter domain knowledge for learning in bayesian networks , 2005 .

[58]  Tom M. Mitchell,et al.  Classifying Instantaneous Cognitive States from fMRI Data , 2003, AMIA.

[59]  Alexander H. Waibel,et al.  Modular Construction of Time-Delay Neural Networks for Speech Recognition , 1989, Neural Computation.

[60]  Tom Heskes,et al.  Solving a Huge Number of Similar Tasks: A Combination of Multi-Task Learning and a Hierarchical Bayesian Approach , 1998, ICML.

[61]  R. Veit,et al.  Self-regulation of local brain activity using real-time functional magnetic resonance imaging (fMRI) , 2004, Journal of Physiology-Paris.

[62]  Marco Patella,et al.  PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[63]  F. Guenther,et al.  A Wireless Brain-Machine Interface for Real-Time Speech Synthesis , 2009, PloS one.

[64]  Richard M. Leahy,et al.  Electromagnetic brain mapping , 2001, IEEE Signal Process. Mag..

[65]  Guilherme V. Rocha,et al.  Greedy and Relaxed Approximations to Model Selection : A simulation study , 2008 .

[66]  Trevor Darrell,et al.  An efficient projection for l 1 , infinity regularization. , 2009, ICML 2009.

[67]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[68]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  I. Rustandi Hierarchical Gaussian Näıve Bayes Classifier for Multiple-Subject fMRI Data , 2006 .

[70]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[71]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[72]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[73]  Dinggang Shen,et al.  Classifying spatial patterns of brain activity with machine learning methods: Application to lie detection , 2005, NeuroImage.

[74]  E. Halgren,et al.  Dynamic Statistical Parametric Mapping Combining fMRI and MEG for High-Resolution Imaging of Cortical Activity , 2000, Neuron.

[75]  J. Gallant,et al.  Identifying natural images from human brain activity , 2008, Nature.

[76]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[77]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[78]  Paul R. Cohen,et al.  Multiple Comparisons in Induction Algorithms , 2000, Machine Learning.

[79]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[80]  L. Cohen,et al.  Brain–computer interfaces: communication and restoration of movement in paralysis , 2007, The Journal of physiology.

[81]  Kristen Grauman,et al.  Interactively building a discriminative vocabulary of nameable attributes , 2011, CVPR 2011.

[82]  Shimon Ullman,et al.  Cross-generalization: learning novel classes from a single example by feature replacement , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[83]  Mark Palatucci,et al.  On the chance accuracies of large collections of classifiers , 2008, ICML '08.

[84]  N. Stanietsky,et al.  The interaction of TIGIT with PVR and PVRL2 inhibits human NK cell cytotoxicity , 2009, Proceedings of the National Academy of Sciences.

[85]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[86]  Tony Veale,et al.  WordNet Sits the S.A.T. - A Knowledge-Based Approach to Lexical Analogy , 2004, ECAI.

[87]  R. Ilmoniemi,et al.  Signal-space projection method for separating MEG or EEG into components , 1997, Medical and Biological Engineering and Computing.

[88]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[89]  B. Swartz,et al.  Timeline of the history of EEG and associated fields. , 1998, Electroencephalography and clinical neurophysiology.

[90]  T. Allison,et al.  Functional anatomy of biological motion perception in posterior temporal cortex: an FMRI study of eye, mouth and hand movements. , 2005, Cerebral cortex.

[91]  Sean M. Polyn,et al.  Beyond mind-reading: multi-voxel pattern analysis of fMRI data , 2006, Trends in Cognitive Sciences.

[92]  P. Kuhl,et al.  Infant speech perception activates Broca's area: a developmental magnetoencephalography study , 2006, Neuroreport.

[93]  O. Bagasra,et al.  Proceedings of the National Academy of Sciences , 1914, Science.

[94]  Peter C. Hansen,et al.  MEG. An introduction to methods , 2010 .

[95]  Yul-Wan Sung,et al.  Functional magnetic resonance imaging , 2004, Scholarpedia.

[96]  J. G. Snodgrass,et al.  A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity. , 1980, Journal of experimental psychology. Human learning and memory.

[97]  James Andrew Bagnell,et al.  Learning in modular systems , 2010 .

[98]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[99]  M. Wainwright,et al.  Simultaneous support recovery in high dimensions : Benefits and perils of block l 1 / l ∞-regularization , 2009 .

[100]  Marco Iacoboni,et al.  Us versus them: Political attitudes and party affiliation influence neural response to faces of presidential candidates , 2007, Neuropsychologia.

[101]  Tom M. Mitchell,et al.  Learning to Decode Cognitive States from Brain Images , 2004, Machine Learning.

[102]  Larry A. Wasserman,et al.  SpAM: Sparse Additive Models , 2007, NIPS.

[103]  R. Salmelin Clinical neurophysiology of language: The MEG approach , 2007, Clinical Neurophysiology.

[104]  Michael I. Jordan,et al.  Multi-task feature selection , 2006 .

[105]  Bin He,et al.  Mapping the bilateral visual integration by EEG and fMRI , 2009, NeuroImage.

[106]  Shai Ben-David,et al.  Exploiting Task Relatedness for Mulitple Task Learning , 2003, COLT.

[107]  Indrayana Rustandi,et al.  Predictive fMRI analysis for multiple subjects and multiple studies , 2010 .

[108]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[109]  Tom Michael Mitchell,et al.  Predicting Human Brain Activity Associated with the Meanings of Nouns , 2008, Science.

[110]  Tom M. Mitchell,et al.  Bayesian Network Learning with Parameter Constraints , 2006, J. Mach. Learn. Res..

[111]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[112]  E Donchin,et al.  The mental prosthesis: assessing the speed of a P300-based brain-computer interface. , 2000, IEEE transactions on rehabilitation engineering : a publication of the IEEE Engineering in Medicine and Biology Society.

[113]  Kara D. Federmeier,et al.  Thirty years and counting: finding meaning in the N400 component of the event-related brain potential (ERP). , 2011, Annual review of psychology.

[114]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[115]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[116]  Peter D. Turney A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations , 2008, COLING.

[117]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[118]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[119]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[120]  Larry Wasserman,et al.  All of Statistics , 2004 .

[121]  Antonio Torralba,et al.  Sharing Visual Features for Multiclass and Multiview Object Detection , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[122]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[123]  David D. Cox,et al.  Functional magnetic resonance imaging (fMRI) “brain reading”: detecting and classifying distributed patterns of fMRI activity in human visual cortex , 2003, NeuroImage.

[124]  Sydney S. Cash,et al.  Decoding word and category-specific spatiotemporal representations from MEG and EEG , 2011, NeuroImage.

[125]  C. Mallows The Collected Works of John W. Tukey, Volume VI, More Mathematical, 1938- 1984. , 1990 .

[126]  J. Tropp Algorithms for simultaneous sparse approximation. Part II: Convex relaxation , 2006, Signal Process..

[127]  R. Savoy Functional Magnetic Resonance Imaging (fMRI) , 2002 .

[128]  Massimo Fornasier,et al.  Recovery Algorithms for Vector-Valued Data with Joint Sparsity Constraints , 2008, SIAM J. Numer. Anal..

[129]  Indrayana Rustandi,et al.  Hidden process models , 2006, ICML.

[130]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[131]  J. Haynes Brain Reading: Decoding Mental States From Brain Activity In Humans , 2011 .

[132]  Joel A. Tropp,et al.  Algorithms for simultaneous sparse approximation. Part I: Greedy pursuit , 2006, Signal Process..

[133]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[134]  Han Liu,et al.  Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery , 2009, ICML '09.

[135]  Peter D. Turney Similarity of Semantic Relations , 2006, CL.

[136]  Herbert A. David,et al.  Order Statistics , 2011, International Encyclopedia of Statistical Science.

[137]  Geoffrey E. Hinton,et al.  Generative versus discriminative training of RBMs for classification of fMRI images , 2008, NIPS.

[138]  Brendan Z. Allison,et al.  Brain-Computer Interfaces , 2010 .

[139]  Frank H. Guenther,et al.  Brain-computer interfaces for speech communication , 2010, Speech Commun..

[140]  Sebastian Thrun,et al.  Learning To Learn: Introduction , 1996 .

[141]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[142]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[143]  Ali Farhadi,et al.  Attribute-centric recognition for cross-category generalization , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.