Toward Enhanced Metadata Quality of Large-Scale Digital Libraries: Estimating Volume Time Range

Metadata is a special type of data that describes data. In the age of Big Data, the role of metadata has become more prominent–it is obvious that big data needs high-quality metadata description as it becomes less and less possible for humans to go over all the data (if human readable) with the exponential growth of data sets. In this study we try to enhance metadata records (publication dates) by developing a temporal classification approach for a large-scale digital library. This approach can help assign temporal information given the full-text content of a library item, such as a book. Temporal classification of text, whether it be webpage content, wikipages, or volumes from Project Gutenberg, has had a growing interest in areas of information retrieval and computational linguistics. The addition of temporal information has been used to significantly improve query search results. Here we contribute methods that incorporate new, higher order n-gram features, specifically bigrams and trigrams, to successfully predict a given document’s membership to a chronon. We were presented with an opportunity to work with the public domain corpora of the HathiTrust (HT) digital library which is the world’s largest digital library with scanned volumes from research libraries covering a wide span of time, from pre-1500 to present. The broad body of digital volumes in HT provides an opportunity to develop a temporal classification approach for this large-scale digital library as well as similar digital libraries. For our data set, 13% of publication date is missing from the metadata records. It thus serves as a good corpus for temporal classification algorithm application.