A combining approach to Find All taxon names (FAT) in legacy biosystematics literature

Most of the literature on natural history is hidden in millions of pages stacked up in our libraries. Various initiatives aim now at making these publications digitally accessible and searchable, applying xml- mark up technologies. The unique biological names play a crucial role to link content related to a particular taxon. Thus discovering and marking them up is extremely important. Since their manual extraction and markup is cumbersome and time-intensive, it needs be automated. In this paper, we present computational linguistics techniques and evaluate how they can help to extract taxonomic names automatically. We build on an existing approach for extraction of such names (Koning et al. 2005) and combine it with several other learning techniques. We apply them to the texts sequentially so that each technique can use the results from the preceding ones. In particular, we use structural rules, dynamic lexica with fuzzy lookups, and word-level language recognition. We use legacy documents from different sources and times as test bed for our evaluation. The experimental results for our combining approach (FAT) show greater than 99% precision and recall. They reveal the potential of computational linguistics techniques towards an automated markup of biosystematics publications.

[1]  Herbert . Lang,et al.  The birds of the Belgian Congo. Part 1. Bulletin of the AMNH ; v. 65 , 1932 .

[2]  George A. Miller,et al.  WordNet: A Lexical Database for the English Language , 2002 .

[3]  京都大学附属図書館,et al.  Systema naturae(自然の体系) , 2006 .

[4]  Cheng Niu,et al.  A Bootstrapping Approach to Named Entity Classification Using Successive Learners , 2003, ACL.

[5]  Lynette Hirschman,et al.  Mixed-Initiative Development of Language Processing Systems , 1997, ANLP.

[6]  Hwee Tou Ng,et al.  Named Entity Recognition: A Maximum Entropy Approach Using Global Information , 2002, COLING.

[7]  Indra Neil Sarkar,et al.  Taxongrab: Extracting Taxonomic Names from Text , 2005 .

[8]  Named Entity Recognition without Gazetteers Using a Machine Learning Approach Paper ID : 85 , 2002 .

[9]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[10]  Herbert Friedmann,et al.  The Birds of the Belgian Congo. Part 3 James P. Chapin , 1954 .

[11]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[12]  Herbert Friedmann,et al.  The Birds of the Belgian Congo, Part. 4 James P. Chapin , 1954 .

[13]  Ellen Riloff Bootstrapping for text learning tasks , 1999 .

[14]  Xavier Carreras,et al.  Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling , 2005, CoNLL.

[15]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[16]  J. Nunamaker,et al.  Proceedings of the 32nd Hawaii International Conference on System Sciences , 1999 .

[17]  Toshihisa Takagi,et al.  PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary , 2000, Bioinform..

[18]  Herbert . Lang,et al.  The birds of the Belgian Congo. Part 2 / by James P. Chapin. Bulletin of the AMNH ; v. 75 , 1939 .

[19]  David D. Palmer,et al.  A Statistical Profile of the Named Entity Task , 1997, ANLP.

[20]  Georgina M Mace,et al.  The role of taxonomy in species conservation. , 2004, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[21]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[22]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[23]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[24]  Herbert . Lang,et al.  The parasitic worms collected by The American Museum of Natural History Expedition to the Belgian Congo, 1909-1914. Part 1, Trematoda. Bulletin of the AMNH ; v. 58, article 6. , 1929 .

[25]  David Yarowsky,et al.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence , 1999, EMNLP.