Virk: an active learning-based system for bootstrapping knowledge base development in the neurosciences

The frequency and volume of newly-published scientific literature is quickly making manual maintenance of publicly-available databases of primary data unrealistic and costly. Although machine learning (ML) can be useful for developing automated approaches to identifying scientific publications containing relevant information for a database, developing such tools necessitates manually annotating an unrealistic number of documents. One approach to this problem, active learning (AL), builds classification models by iteratively identifying documents that provide the most information to a classifier. Although this approach has been shown to be effective for related problems, in the context of scientific databases curation, it falls short. We present Virk, an AL system that, while being trained, simultaneously learns a classification model and identifies documents having information of interest for a knowledge base. Our approach uses a support vector machine (SVM) classifier with input features derived from neuroscience-related publications from the primary literature. Using our approach, we were able to increase the size of the Neuron Registry, a knowledge base of neuron-related information, by a factor of 90%, a knowledge base of neuron-related information, in 3 months. Using standard biocuration methods, it would have taken between 1 and 2 years to make the same number of contributions to the Neuron Registry. Here, we describe the system pipeline in detail, and evaluate its performance against other approaches to sampling in AL.

[1]  Prasad Tadepalli,et al.  Active Learning with Committees for Text Categorization , 1997, AAAI/IAAI.

[2]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[3]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[4]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[5]  Steffen Staab,et al.  Ontology Learning from Text , 2000, NLDB.

[6]  M. Young,et al.  Advanced database methodology for the Collation of Connectivity data on the Macaque brain (CoCoMac). , 2001, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[7]  Steffen Staab,et al.  Ontology Learning from Text , 2000, International Conference on Applications of Natural Language to Data Bases.

[8]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[9]  Zhu Zhang,et al.  Weakly-supervised relation classification for information extraction , 2004, CIKM '04.

[10]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[11]  George Forman,et al.  Tackling concept drift by temporal inductive transfer , 2006, SIGIR.

[12]  Sophia Ananiadou,et al.  Supporting Systematic Reviews Using Text Mining , 2009 .

[13]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[14]  Michael L. Hines,et al.  Interoperability of Neuroscience Modeling Software: Current Status and Future Directions , 2007, Neuroinformatics.

[15]  Aaron M. Cohen,et al.  Case Report: Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes , 2008, J. Am. Medical Informatics Assoc..

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[17]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[18]  Carla E. Brodley,et al.  Semi-automated screening of biomedical citations for systematic reviews , 2010, BMC Bioinformatics.

[19]  Özlem Uzuner Viewpoint Paper: Recognizing Obesity and Comorbidities in Sparse Data , 2009, J. Am. Medical Informatics Assoc..

[20]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[21]  Stan Matwin,et al.  Classifying Biomedical Abstracts Using Committees of Classifiers and Collective Ranking Techniques , 2009, Canadian Conference on AI.

[22]  Aaron M. Cohen,et al.  Research Paper: Cross-Topic Learning for Work Prioritization in Systematic Review Creation and Update , 2009, J. Am. Medical Informatics Assoc..

[23]  Aaron M. Cohen,et al.  Research Paper: A System for Classifying Disease Comorbidity Status from Medical Discharge Summaries Using Automated Hotspot and Negated Concept Detection , 2009, J. Am. Medical Informatics Assoc..

[24]  Marian McDonagh,et al.  A Prospective Evaluation of an Automated Classification System to Support Evidence-based Medicine and Systematic Review. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[25]  Robert J. Arens Learning SVM Ranking Functions from User Feedback Using Document Metadata and Active Learning in the Biomedical Domain , 2010, Preference Learning.

[26]  Jaime G. Carbonell,et al.  Active learning for human protein-protein interaction prediction , 2010, BMC Bioinformatics.

[27]  Philip S. Yu,et al.  Evidence-based medicine, the essential role of systematic reviews, and the need for automated text mining tools , 2010, IHI.

[28]  Aaron M. Cohen,et al.  k-Information Gain Scaled Nearest Neighbors: A Novel Approach to Classifying Protein-Protein Interaction-Related Documents , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Leon Hayes French Bioinformatics for neuroanatomical connectivity , 2012 .

[30]  Mohammed Bennamoun,et al.  Ontology learning from text: A look back and into the future , 2012, CSUR.

[31]  Aaron M Cohen,et al.  Text-mining and neuroscience. , 2012, International review of neurobiology.

[32]  Claire O'Donovan,et al.  Biocurators and Biocuration: surveying the 21st century challenges , 2012, Database J. Biol. Databases Curation.

[33]  Sophie Ahrens,et al.  Recommender Systems , 2012 .

[34]  Eduardo P. Wiechmann,et al.  Active learning for clinical text classification: is it better than random sampling? , 2012, J. Am. Medical Informatics Assoc..

[35]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[36]  Subramaniyaswamy Vairavasundaram,et al.  A Review of Ontology-Based Tag Recommendation Approaches , 2013, Int. J. Intell. Syst..