Statistical algorithms for ontology-based annotation of scientific literature

BackgroundOntologies encode relationships within a domain in robust data structures that can be used to annotate data objects, including scientific papers, in ways that ease tasks such as search and meta-analysis. However, the annotation process requires significant time and effort when performed by humans. Text mining algorithms can facilitate this process, but they render an analysis mainly based upon keyword, synonym and semantic matching. They do not leverage information embedded in an ontology's structure.MethodsWe present a probabilistic framework that facilitates the automatic annotation of literature by indirectly modeling the restrictions among the different classes in the ontology. Our research focuses on annotating human functional neuroimaging literature within the Cognitive Paradigm Ontology (CogPO). We use an approach that combines the stochastic simplicity of naïve Bayes with the formal transparency of decision trees. Our data structure is easily modifiable to reflect changing domain knowledge.ResultsWe compare our results across naïve Bayes, Bayesian Decision Trees, and Constrained Decision Tree classifiers that keep a human expert in the loop, in terms of the quality measure of the F1-mirco score.ConclusionsUnlike traditional text mining algorithms, our framework can model the knowledge encoded by the dependencies in an ontology, albeit indirectly. We successfully exploit the fact that CogPO has explicitly stated restrictions, and implicit dependencies in the form of patterns in the expert curated annotations.

[1]  Jessica A. Turner,et al.  The Cognitive Paradigm Ontology: Design and Application , 2011, Neuroinformatics.

[2]  Jessica A. Turner,et al.  A Probabilistic Framework for Ontology-Based Annotation in Neuroimaging Literature , 2013 .

[3]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[4]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[5]  Hans-Michael Müller,et al.  The Neuroscience Information Framework: A Data and Knowledge Environment for Neuroscience , 2008, Neuroinformatics.

[6]  Rong Chen,et al.  Ontology-driven indexing of public datasets for translational bioinformatics , 2009, BMC Bioinformatics.

[7]  Angela R. Laird,et al.  BrainMap , 2007, Neuroinformatics.

[8]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[9]  Raghu Krishnapuram,et al.  Fuzzy co-clustering of documents and keywords , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[10]  Eyke Hüllermeier,et al.  An Exact Algorithm for F-Measure Maximization , 2011, NIPS.

[11]  Philip S. Yu,et al.  Multi-Objective Multi-Label Classification , 2012, SDM.

[12]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[13]  David Wheeler,et al.  Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils) , 2004 .

[14]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[15]  Jean-Cédric Chappelier,et al.  PLSI: The True Fisher Kernel and beyond , 2009, ECML/PKDD.

[16]  John Elder,et al.  Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications , 2012 .

[17]  Angela R Laird,et al.  Brainmap taxonomy of experimental design: Description and evaluation , 2005, Human brain mapping.

[18]  Dino Isa,et al.  Text Document Pre-Processing Using the Bayes Formula for Classification Based on the Vector Space Model , 2008, Comput. Inf. Sci..

[19]  Philip S. Yu,et al.  Multi-Label Classification Based on Multi-Objective Optimization , 2014, TIST.

[20]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[21]  Jessica A. Turner,et al.  Automated annotation of functional imaging experiments via multi-label classification , 2013, Front. Neurosci..

[22]  L. Sacks,et al.  Evaluating fuzzy clustering for relevance-based information access , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[23]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[24]  Eyke Hüllermeier,et al.  Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss , 2010, ECML/PKDD.

[25]  Constantine Kotropoulos,et al.  RPLSA: A novel updating scheme for Probabilistic Latent Semantic Analysis , 2011, Comput. Speech Lang..

[26]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[27]  Djoerd Hiemstra,et al.  A cross-lingual framework for monolingual biomedical information retrieval , 2010, CIKM.

[28]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[29]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .