Metadata is a special type of data that describes data. In the age of Big Data, the role of metadata has become more prominent–it is obvious that big data needs high-quality metadata description as it becomes less and less possible for humans to go over all the data (if human readable) with the exponential growth of data sets. In this study we try to enhance metadata records (publication dates) by developing a temporal classification approach for a large-scale digital library. This approach can help assign temporal information given the full-text content of a library item, such as a book. Temporal classification of text, whether it be webpage content, wikipages, or volumes from Project Gutenberg, has had a growing interest in areas of information retrieval and computational linguistics. The addition of temporal information has been used to significantly improve query search results. Here we contribute methods that incorporate new, higher order n-gram features, specifically bigrams and trigrams, to successfully predict a given document’s membership to a chronon. We were presented with an opportunity to work with the public domain corpora of the HathiTrust (HT) digital library which is the world’s largest digital library with scanned volumes from research libraries covering a wide span of time, from pre-1500 to present. The broad body of digital volumes in HT provides an opportunity to develop a temporal classification approach for this large-scale digital library as well as similar digital libraries. For our data set, 13% of publication date is missing from the metadata records. It thus serves as a good corpus for temporal classification algorithm application.
[1]
Matthew Lease,et al.
Dating Texts without Explicit Temporal Cues
,
2012,
ArXiv.
[2]
Djoerd Hiemstra,et al.
Temporal Language Models for the Disclosure of Historical Text
,
2005
.
[3]
Chih-Jen Lin,et al.
LIBLINEAR: A Library for Large Linear Classification
,
2008,
J. Mach. Learn. Res..
[4]
Wessel Kraaij,et al.
Variations on language modeling for information retrieval
,
2005,
SIGF.
[5]
Kjetil Nørvåg,et al.
Improving Temporal Language Models for Determining Time of Non-timestamped Documents
,
2008,
ECDL.
[6]
Lynn A. Streeter,et al.
Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval
,
1989,
Inf. Process. Manag..
[7]
Ricardo Baeza-Yates,et al.
Clustering and exploring search results using timeline constructions
,
2009,
CIKM.
[8]
Gaël Varoquaux,et al.
Scikit-learn: Machine Learning in Python
,
2011,
J. Mach. Learn. Res..
[9]
Wes McKinney,et al.
Data Structures for Statistical Computing in Python
,
2010,
SciPy.
[10]
Michael Gertz,et al.
On the value of temporal information in information retrieval
,
2007,
SIGF.
[11]
Ole Tange,et al.
GNU Parallel: The Command-Line Power Tool
,
2011,
login Usenix Mag..
[12]
Matthew Lease,et al.
Supervised language modeling for temporal resolution of texts
,
2011,
CIKM '11.