Solving the AL Chicken-and-Egg Corpus and Model Problem: Model-free Active Learning for Phenomena-driven Corpus Construction

Active learning (AL) is often used in corpus construction (CC) for selecting “informative” documents for annotation. This is ideal for focusing annotation efforts when all documents cannot be annotated, but has the limitation that it is carried out in a closed-loop, selecting points that will improve an existing model. For phenomena-driven and exploratory CC, the lack of existing-models and specific task(s) for using it make traditional AL inapplicable. In this paper we propose a novel method for model-free AL utilising characteristics of phenomena for applying AL to select documents for annotation. The method can also supplement traditional closed-loop AL-based CC to extend the utility of the corpus created beyond a single task. We introduce our tool, MOVE, and show its potential with a real world case-study.

[1]  Tony McEnery,et al.  Corpus Linguistics: Method, Theory and Practice , 1996 .

[2]  Eric P. Xing,et al.  Network Completion and Survey Sampling , 2009, AISTATS.

[3]  Dain Kaplan,et al.  Automatic Extraction of Citation Contexts for Research Paper Summarization: A Coreference-chain based Approach , 2009 .

[4]  E GARFIELD,et al.  Citation indexes for science; a new dimension in documentation through association of ideas. , 2006, Science.

[5]  Awais Athar,et al.  Sentiment Analysis of Citations using Sentence Structure-Based Features , 2011, ACL.

[6]  Udo Hahn,et al.  An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data , 2007, EMNLP.

[7]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[8]  Noriko Kando,et al.  Classification of research papers using citation links and citation types: Towards automatic review article generation. , 2011 .

[9]  Masashi Sugiyama,et al.  A batch ensemble approach to active learning with model selection , 2008, Neural Networks.

[10]  Michael P. Friedlander,et al.  Probing the Pareto Frontier for Basis Pursuit Solutions , 2008, SIAM J. Sci. Comput..

[11]  Simone Teufel,et al.  Automatic classification of citation function , 2006, EMNLP.

[12]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[13]  Udo Hahn,et al.  Event Extraction from Trimmed Dependency Graphs , 2009, BioNLP@HLT-NAACL.

[14]  Dragomir R. Radev,et al.  Scientific Paper Summarization Using Citation Summary Networks , 2008, COLING.

[15]  Fredrik Olsson,et al.  A Web Survey on the Use of Active Learning to Support Annotation of Text Data , 2009, HLT-NAACL 2009.

[16]  Jiawei Han,et al.  A Variance Minimization Criterion to Active Learning on Graphs , 2012, AISTATS.

[17]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[18]  E. David,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World , 2010 .

[19]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[20]  Sofus A. Macskassy Using graph-based metrics with empirical risk minimization to speed up active learning on networked data , 2009, KDD.

[21]  M. Moravcsik,et al.  Some Results on the Function and Quality of Citations , 1975 .

[22]  Jie Tang,et al.  Combining link and content for collective active learning , 2010, CIKM.

[23]  Abhay Harpale,et al.  Multi-Task Active Learning , 2012 .

[24]  Rasoul Karimi,et al.  Active Learning for Recommender Systems , 2015, KI - Künstliche Intelligenz.

[25]  Chunyu Kit,et al.  Active Learning Based Corpus Annotation , 2010, CIPS-SIGHAN.

[26]  Marti A. Hearst,et al.  Citances: Citation Sentences for Semantic Analysis of Bioscience Text , 2004 .

[27]  Fredrik Olsson,et al.  A literature survey of active machine learning in the context of natural language processing , 2009 .

[28]  Lise Getoor,et al.  Link-based Active Learning , 2009, NIPS 2009.

[29]  I. Spiegel-Rosing Science Studies: Bibliometric and Content Analysis , 1977 .

[30]  Melvin Weinatoek Citation Indexes , .

[31]  Reinhard Diestel,et al.  Locally finite graphs with ends: A topological approach, II. Applications , 2010, Discret. Math..

[32]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[33]  Jason Baldridge,et al.  Active Learning and the Total Cost of Annotation , 2004, EMNLP.

[34]  Roser Morante,et al.  Designing an active learning based system for corpus annotation , 2005, Proces. del Leng. Natural.

[35]  A. Kuwadekar Combining Semi-supervised Learning and Relational Resampling for Active Learning in Network Domains , 2010 .

[36]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[37]  Jasbir S. Arora,et al.  Survey of multi-objective optimization methods for engineering , 2004 .

[38]  Chris Arney,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World (Easley, D. and Kleinberg, J.; 2010) [Book Review] , 2013, IEEE Technology and Society Magazine.

[39]  Akira Namatame,et al.  Evolving Failure Resilience in Scale-Free Networks , 2009 .

[40]  Burr Settles,et al.  A Software Tool for Biomedical Information Extraction (And Beyond) , 2009, Information Retrieval in Biomedicine.

[41]  Masashi Sugiyama,et al.  Coping with Active Learning with Model Selection Dilemma: Minimizing Expected Generalization Error , 2006 .

[42]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[43]  H. D. White Citation Analysis and Discourse Analysis Revisited. , 2004 .

[44]  Dragomir R. Radev,et al.  Blind men and elephants: What do citation summaries tell us about a research article? , 2008, J. Assoc. Inf. Sci. Technol..

[45]  Reinhard Diestel,et al.  Locally finite graphs with ends: A topological approach, III. Fundamental group and homology , 2010, Discret. Math..

[46]  Andrew McCallum,et al.  Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[47]  Dragomir R. Radev,et al.  Introduction to the Special Issue on Summarization , 2002, CL.

[48]  Eugene Garfield,et al.  THE USE OF CITATION DATA IN WRITING THE HISTORY OF SCIENCE , 1964 .

[49]  Katrin Tomanek,et al.  Resource-aware annotation through active learning , 2010 .

[50]  Thomas Weise,et al.  Global Optimization Algorithms -- Theory and Application , 2009 .

[51]  Jessica Andrea Carballido,et al.  On Stopping Criteria for Genetic Algorithms , 2004, SBIA.

[52]  David R. Karger Randomization in Graph Optimization Problems: A Survey , 2007 .

[53]  Sean M. McNee,et al.  Getting to know you: learning new user preferences in recommender systems , 2002, IUI '02.