Automatic Metadata Extraction From Museum Specimen Labels

This paper describes the information properties of museum specimen labels and machine learning tools to automatically extract Darwin Core (DwC) and other metadata from these labels processed through Optical Character Recognition (OCR). The DwC is a metadata profile describing the core set of access points for search and retrieval of natural history collections and observation databases. Using the HERBIS Learning System (HLS) we extract 74 independent elements from these labels. The automated text extraction tools are provided as a web service so that users can reference digital images of specimens and receive back an extended Darwin Core XML representation of the content of the label. This automated extraction task is made more difficult by the high variability of museum label formats, OCR errors and the open class nature of some elements. In this paper we introduce our overall system architecture, and variability robust solutions including, the application of Hidden Markov and Naive Bayes machine learning models, data cleaning, use of field element identifiers, and specialist learning models. The techniques developed here could be adapted to any metadata extraction situation with noisy text and weakly ordered elements.

[1]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[3]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[4]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[5]  Ian Witten,et al.  Data Mining , 2000 .

[6]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[7]  Atsuhiro Takasu,et al.  DVHMM: variable length text recognition error model , 2002, Object recognition supported by user interaction for service robots.

[8]  Steven P. Abney,et al.  Bootstrapping , 2002, ACL.

[9]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[10]  Raymond J. Mooney,et al.  Constructing Diverse Classifier Ensembles using Artificial Training Examples , 2003, IJCAI.

[11]  James R. Curran,et al.  Blueprint for a High Performance NLP Infrastructure , 2003, HLT-NAACL 2003.

[12]  Sougata Mukherjea,et al.  Information extraction from biomedical literature: methodology, evaluation and an application , 2003, CIKM '03.

[13]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[14]  Giovanni Soda,et al.  Hidden Markov Models for Text Categorization in Multi-Page Documents , 2002, Journal of Intelligent Information Systems.

[15]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[16]  Xiangmin Zhang,et al.  Rule-based word clustering for document metadata extraction , 2005, SAC '05.

[17]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[18]  Linda C. Smith,et al.  Automating semantic markup of semi-structured text via an induced knowledge base: a case study using floras , 2005 .

[19]  Pabitra Mitra,et al.  Extracting semantic structure of web documents using content and visual information , 2005, WWW '05.

[20]  Jane Greenberg,et al.  Functionalities for automatic metadata generation applications: a survey of metadata experts' opinions , 2006, Int. J. Metadata Semant. Ontologies.

[21]  Jennifer Foster Treebanks Gone Bad Parser Evaluation and Retraining using a Treebank of Ungrammatical Sentences , 2007 .

[22]  P. Bryan Heidorn,et al.  The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions , 2007 .