Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora

This thesis describes the development and in-depth empirical investigation of a method, called BootMark, for bootstrapping the marking up of named entities in textual documents. The reason for working with documents, as opposed to for instance sentences or phrases, is that the BootMark method is concerned with the creation of corpora. The claim made in the thesis is that BootMark requires a human annotator to manually annotate fewer documents in order to produce a named entity recognizer with a given performance, than would be needed if the documents forming the basis for the recognizer were randomly drawn from the same corpus. The intention is then to use the created named en- tity recognizer as a pre-tagger and thus eventually turn the manual annotation process into one in which the annotator reviews system-suggested annotations rather than creating new ones from scratch. The BootMark method consists of three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping – active machine learning for the purpose of selecting which document to an- notate next; (3) The remaining unannotated documents of the original corpus are marked up using pre-tagging with revision. Five emerging issues are identified, described and empirically investigated in the thesis. Their common denominator is that they all depend on the real- ization of the named entity recognition task, and as such, require the context of a practical setting in order to be properly addressed. The emerging issues are related to: (1) the characteristics of the named entity recognition task and the base learners used in conjunction with it; (2) the constitution of the set of documents annotated by the human annotator in phase one in order to start the bootstrapping process; (3) the active selection of the documents to annotate in phase two; (4) the monitoring and termination of the active learning carried out in phase two, including a new intrinsic stopping criterion for committee-based active learning; and (5) the applicability of the named entity recognizer created during phase two as a pre-tagger in phase three. The outcomes of the empirical investigations concerning the emerging is- sues support the claim made in the thesis. The results also suggest that while the recognizer produced in phases one and two is as useful for pre-tagging as a recognizer created from randomly selected documents, the applicability of the recognizer as a pre-tagger is best investigated by conducting a user study involving real annotators working on a real named entity recognition task.

[1]  David Yarowsky,et al.  Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking , 2000, ACL.

[2]  S. Miksch,et al.  Information Extraction A Survey , 2005 .

[3]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[4]  Shlomo Argamon,et al.  Committee-Based Sample Selection for Probabilistic Classifiers , 1999, J. Artif. Intell. Res..

[5]  Eric K. Ringger,et al.  Assessing the Costs of Sampling Methods in Active Learning for Annotation , 2008, ACL.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Craig A. Knoblock,et al.  Adaptive View Validation: A First Step Towards Automatic View Detection , 2002, ICML.

[8]  Shlomo Argamon,et al.  Minimizing Manual Annotation Cost in Supervised Training from Corpora , 1996, ACL.

[9]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[12]  Vishal Gupta,et al.  A survey of Named Entity Recognition in English and other Indian Languages , 2010 .

[13]  Raymond J. Mooney,et al.  Constructing Diverse Classifier Ensembles using Artificial Training Examples , 2003, IJCAI.

[14]  Kuzman Ganchev,et al.  Semi-Automated Named Entity Annotation , 2007, LAW@ACL.

[15]  Mark Steedman,et al.  Example Selection for Bootstrapping Statistical Parsers , 2003, NAACL.

[16]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[17]  Prasad Tadepalli,et al.  Active Learning with Committees for Text Categorization , 1997, AAAI/IAAI.

[18]  Shona Douglas Active Learning for Classifying Phone Sequences from Unsupervised Phonotactic Models , 2003, HLT-NAACL.

[19]  Satoshi Sekine,et al.  Description of the Japanese NE System Used for MET-2 , 1998, MUC.

[20]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[21]  Eric K. Ringger,et al.  Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation , 2007, LAW@ACL.

[22]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[23]  Beatrice Alex,et al.  Optimising Selective Sampling for Bootstrapping Named Entity Recognition , 2005, ICML 2005.

[24]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[25]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[26]  Nigel Collier,et al.  The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers , 1999, EACL.

[27]  Andrei Mikheev,et al.  Document centered approach to text normalization , 2000, SIGIR '00.

[28]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[29]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[30]  Thomas S. Morton,et al.  WordFreak: An Open Tool for Linguistic Annotation , 2003, HLT-NAACL.

[31]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[32]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[33]  Fredrik Olsson,et al.  An Intrinsic Stopping Criterion for Committee-Based Active Learning , 2009, CoNLL.

[34]  Jason Baldridge,et al.  Active Learning and the Total Cost of Annotation , 2004, EMNLP.

[35]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[36]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[37]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[38]  Jingbo Zhu,et al.  Learning a Stopping Criterion for Active Learning for Word Sense Disambiguation and Text Classification , 2008, IJCNLP.

[39]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[40]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[41]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[42]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[43]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[44]  Rada Mihalcea,et al.  Building a Sense Tagged Corpus with Open Mind Word Expert , 2002, SENSEVAL.

[45]  Jian Su,et al.  Multi-Criteria-based Active Learning for Named Entity Recognition , 2004, ACL.

[46]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[47]  Lynette Hirschman,et al.  Alembic Workbench corpus developrnent tool , 1998, International Conference on Language Resources and Evaluation.

[48]  Timo Järvinen,et al.  A non-projective dependency parser , 1997, ANLP.

[49]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[50]  Alexiei Dingli,et al.  User-System Cooperation in Document Annotation Based on Information Extraction , 2002, EKAW.

[51]  Nancy Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[52]  Aidan Finn,et al.  Active Learning Selection Strategies for Information Extraction , 2003 .

[53]  Hwee Tou Ng,et al.  Domain Adaptation with Active Learning for Word Sense Disambiguation , 2007, ACL.

[54]  Feng Gao,et al.  A Weakly Supervised Learning Approach for Spoken Language Understanding , 2006, EMNLP.

[55]  Ralph Grishman,et al.  NYU: Description of the MENE Named Entity System as Used in MUC-7 , 1998, MUC.

[56]  Nigel Collier,et al.  Automatic Term Identification and Classification in Biology Texts. , 1999 .

[57]  David D. Lewis,et al.  A sequential algorithm for training text classifiers: corrigendum and additional data , 1995, SIGF.

[58]  Walter Daelemans,et al.  Evaluation of Machine Learning Methods for Natural Language Processing Tasks , 2002, LREC.

[59]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[60]  Lynette Hirschman,et al.  Mixed-Initiative Development of Language Processing Systems , 1997, ANLP.

[61]  Gökhan Tür,et al.  Combining active and semi-supervised learning for spoken language understanding , 2005, Speech Commun..

[62]  Raymond J. Mooney,et al.  Diverse ensembles for active learning , 2004, ICML.

[63]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[64]  Juan-Zi Li,et al.  Feature-Correlation Based Multi-view Detection , 2005, ICCSA.

[65]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[66]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[67]  Craig A. Knoblock,et al.  Selective Sampling with Redundant Views , 2000, AAAI/IAAI.

[68]  Miles Osborne,et al.  A Two-Stage Method for Active Learning of Statistical Grammars , 2005, IJCAI.

[69]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[70]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[71]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[72]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[73]  Ari Rappoport,et al.  An Ensemble Method for Selection of High Quality Parses , 2007, ACL.

[74]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[75]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[76]  Craig A. Knoblock,et al.  Active + Semi-supervised Learning = Robust Multi-View Learning , 2002, ICML.

[77]  Jingbo Zhu,et al.  Multi-Criteria-Based Strategy to Stop Active Learning for Data Annotation , 2008, COLING.

[78]  Ralph Grishman,et al.  Customization of information extraction systems , 1997 .

[79]  Hitoshi Isahara,et al.  A Probabilistic Approach to Feature Selection for Multi-class Text Categorization , 2007, ISNN.

[80]  D LewisDavid A sequential algorithm for training text classifiers , 1995 .

[81]  Gary Geunbae Lee,et al.  MMR-based Active Machine Learning for Bio Named Entity Recognition , 2006, NAACL.

[82]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[83]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[84]  Jun'ichi Tsujii,et al.  Part-of-Speech Annotation of Biology Research Abstracts , 2004, LREC.

[85]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[86]  Pedro M. Domingos A Unifeid Bias-Variance Decomposition and its Applications , 2000, ICML.

[87]  Rebecca Hwa,et al.  Sample Selection for Statistical Grammar Induction , 2000, EMNLP.

[88]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[89]  Maria-Florina Balcan,et al.  Co-Training and Expansion: Towards Bridging Theory and Practice , 2004, NIPS.

[90]  Fredrik Olsson Requirements and design considerations for an open and general architecture for information refinement , 2002 .

[91]  Hinrich Schütze,et al.  Stopping Criteria for Active Learning of Named Entity Recognition , 2008, COLING.

[92]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[93]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[94]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[95]  Hitoshi Isahara,et al.  IREX: IR & IE Evaluation Project in Japanese , 2000, LREC.

[96]  Haizhou Li,et al.  Learning Transliteration Lexicons from the Web , 2006, ACL.

[97]  Stefan Wrobel,et al.  Multi-class Ensemble-Based Active Learning , 2006, ECML.

[98]  Wen-Lian Hsu,et al.  A Semi-Automatic Method for Annotating a Biomedical Proposition Bank , 2006 .

[99]  Jason Baldridge,et al.  Ensemble-based Active Learning for Parse Selection , 2004, NAACL.

[100]  Andreas Vlachos,et al.  A stopping criterion for active learning , 2008, Computer Speech and Language.

[101]  Thorsten Brants,et al.  Interactive Corpus Annotation , 2000, LREC.

[102]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[103]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[104]  Udo Hahn,et al.  Efficient Annotation with the Jena ANnotation Environment (JANE) , 2007, LAW@ACL.

[105]  Ion Muslea,et al.  Active Learning with Multiple Views , 2009, Encyclopedia of Data Warehousing and Mining.

[106]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[107]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[108]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[109]  Rada Mihalcea,et al.  Open Mind Word Expert: Creating Large Annotated Data Collections with Web Users’ Help , 2003, LINC@EACL.

[110]  Alexiei Dingli,et al.  Timely and Non-Intrusive Active Document Annotation via Adaptive Information Extraction , 2002, SAAKM@ECAI.

[111]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[112]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[113]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[114]  Xavier Carreras,et al.  Learning a Perceptron-Based Named Entity Chunker via Online Recognition Feedback , 2003, CoNLL.

[115]  Beatrice Alex,et al.  Investigating the Effects of Selective Sampling on the Annotation Task , 2005 .

[116]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[117]  Udo Hahn,et al.  Approximating Learning Curves for Active-Learning-Driven Annotation , 2008, LREC.

[118]  Martha Palmer,et al.  An Empirical Study of the Behavior of Active Learning for Word Sense Disambiguation , 2006, NAACL.

[119]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[120]  Erik F. Tjong Kim Sang,et al.  Memory-Based Named Entity Recognition , 2002, CoNLL.

[121]  Andreas Vlachos,et al.  Active Annotation , 2022 .

[122]  Paul A. Viola,et al.  Corrective feedback and persistent learning for information extraction , 2006, Artif. Intell..

[123]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[124]  Manabu Sassano,et al.  An Empirical Study of Active Learning with Support Vector Machines for Japanese Word Segmentation , 2002, ACL.

[125]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[126]  Udo Hahn,et al.  An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data , 2007, EMNLP.

[127]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[128]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[129]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[130]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.