A metadata generation system for scanned scientific volumes

Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. In this work, we tackle the problem of extracting metadata from scanned volumes of journals. Our goal is to extract information describing internal structures and content of scanned volumes, which is necessary for providing effective content access functionalities to digital library users. We propose methods for automatically generating volume level, issue level, and article level metadata based on format and text features extracted from OCRed text. We show the performance of our system on scanned bound historical documents nearly two centuries old. We have developed the system and integrated it into an operational digital library, the Internet Archive, for real-world usage.

[1]  Naomi Dushay Localizing experience of digital content via structural metadata , 2002, JCDL '02.

[2]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[3]  William Y. Arms,et al.  An Architecture for Information in Digital Libraries , 1997, D Lib Mag..

[4]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[5]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[6]  botanical libraries,et al.  Biodiversity Heritage Library , 2009 .

[7]  Carrie Lowe GEM , 2000 .

[8]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[9]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[10]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[11]  Andreas Stolcke,et al.  Structural metadata research in the EARS program , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Howard Besser The Next Stage: Moving from Isolated Digital Collections to Interoperable Digital Libraries , 2002, First Monday.

[13]  Francesca Cesarini,et al.  Page Classification for Meta-data Extraction from Digital Collections , 2001, DEXA.

[14]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[15]  Song Mao,et al.  A dynamic feature generation system for automated metadata extraction in preservation of digital materials , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[16]  Hui Han,et al.  A service-oriented architecture for digital libraries , 2004, ICSOC '04.

[17]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.