The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection

This paper evaluates automatic extraction of ten named entity classes from a 19th century newspaper, the Civil War years of the Richmond Times Dispatch, digitized with IMLS support by the University of Richmond. This paper analyzes success with ten categories of entities prominent in these newspapers and the particular problems that these classes of named entities raise. Personal and place names are familiar but some more important categories (such as ship names and military units) illustrate some of the challenges that named entity identification confronts as it evolves into a fundamental tool not only for automatic metadata generation but also for searching and browsing as well. We conclude by suggesting the kinds of knowledge sources that digital libraries need to assemble as part of their machine readable reference collections to support named entity identification as a core service

[1]  Robert B. Allen A query interface for an event gazetteer , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[2]  Ronald W. Zweig Lessons from the Palestine post project , 1998 .

[3]  Robert B. Allen,et al.  Metadata and data structures for the historical newspaper digital library , 1999, CIKM '99.

[4]  Edie M. Rasmussen,et al.  Searching for images: The analysis of users' queries for image retrieval in American history , 2003, J. Assoc. Inf. Sci. Technol..

[5]  Kenning Arlitsch,et al.  Microfilm, Paper, and OCR: Issues in Newspaper Digitization. The Utah Digital Newspapers Program , 2004 .

[6]  Lesk Michael The qualitative advantages of quantities of information: bigger is better , 2005 .

[7]  Gregory R. Crane,et al.  Disambiguating Geographic Names in a Historical Digital Library , 2001, ECDL.

[8]  David A. Smith,et al.  Detecting and Browsing Events in Unstructured text , 2002, SIGIR '02.

[9]  Robert B. Allen A focus-context browser for multiple timelines , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[10]  David A. Smith Detecting events with date and place information in unstructured text , 2002, JCDL '02.

[11]  John Herbert,et al.  digitalnewspapers.org , 2004 .

[12]  Robert Shoemaker Digital London: Creating a searchable web of interlinked sources on eighteenth century London , 2005, Program.

[13]  Matt Jones,et al.  Searching and Browsing in a Digital Library of Historical Maps and Newspapers , 2005, J. Digit. Inf..

[14]  Ray L. Murray Toward a metadata standard for digitized historical newspapers , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[15]  Donald S. MacQueen Developing Methods for Very-Large-Scale Searches in Proquest Historical Newspapers Collection and Infotrac the Times Digital Archive: The Case of Two Million Versus Two Millions , 2004 .

[16]  Kalina Bontcheva,et al.  Evolving GATE to meet new challenges in language engineering , 2004, Natural Language Engineering.

[17]  Kalina Bontcheva,et al.  Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content , 2002, ECDL.

[18]  Lynda James-Gilboe The Challenge of Digitization , 2005 .

[19]  George Buchanan,et al.  Information Seeking by Humanities Scholars , 2005, ECDL.

[20]  Susan Siegfried,et al.  An Analysis of Search Terminology Used by Humanities Scholars: The Getty Online Searching Project Report Number 1 , 1993, The Library Quarterly.

[21]  Edmund King Digitisation of Newspapers at the British Library , 2005 .

[22]  David E. Millard,et al.  Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[23]  Catherine A. Johnson,et al.  Where Is the List with All the Names? Information-Seeking Behavior of Genealogists , 2007 .

[24]  Ian H. Witten,et al.  Text mining in a digital library , 2004, International Journal on Digital Libraries.

[25]  R. Rosenzweig Scarcity or Abundance? Preserving the Past in a Digital Era , 2003 .

[26]  Catherine A. Johnson,et al.  Accidentally Found on Purpose: Information-Seeking Behavior of Historians in Archives , 2002, The Library Quarterly.

[27]  Rafael Berlanga Llavori,et al.  Building a Hierarchy of Events and Topics for Newspaper Digital Libraries , 2003, ECIR.

[28]  Lesk Michael The qualitative advantages of quantities of information: bigger is better , 2005 .