Preserving Imperfection: Assessing the Incidence of Digital Imaging Error in HathiTrust

Abstract Large-scale digitization efforts by third-party firms are the subject of no small amount of controversy and criticism, as is especially the case with Google Books. This article reports some of the findings and important implications of a rigorous multi-year quantitative and qualitative assessment of the images representing a sizable proportion of the digital surrogates created by Google and deposited in the HathiTrust, which is one of the most important large-scale preservation initiatives to emerge in higher education in the past fifty years. The population of study described here consists of Englishlanguage books and serials published before 1923 that were scanned and processed by Google between 2004 and 2010. At the time the data for the study were gathered (2011), this population consisted of approximately 1.25 million volumes or roughly 12 percent of the HathiTrust corpus. The findings suggest that the imperfection of digital surrogates is an obvious and nearly ubiquitous feature of Google Books and that such imperfection has become and will remain firmly ensconced in collaborative preservation repositories.

[1]  P. Conway,et al.  Archival Preservation: Definitions for Improving Education and Training , 1989 .

[2]  B. Jovanovic,et al.  A Look at the Rule of Three , 1997 .

[3]  David S. Doermann,et al.  Progress in camera-based document image analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[4]  South West Tafe Mission and goals , 2003 .

[5]  Véronique Eglin,et al.  Document images analysis solutions for digital libraries , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[6]  Mary Ellen Starmer,et al.  Surveying the Stacks: Collecting Data and Analyzing Results with SPSS , 2004 .

[7]  Lorcan Dempsey,et al.  Anatomy of Aggregate Collections: The Example of Google Print for Libraries , 2005, D Lib Mag..

[8]  Anne Karle-Zenith Google Book Search and the University of Michigan , 2006 .

[9]  Emily Anne Proskine Google's Technicolor Dreamcoat: A Copyright Analysis of the Google Book Search Library Project , 2006 .

[10]  Paul N. Courant,et al.  Scholarship and Academic Libraries (and their kin) in the World of Google , 2006, First Monday.

[11]  Xiaofan Lin Quality assurance in high volume document digitization: a survey , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[12]  Paul Duguid,et al.  Inheritance and loss? A brief survey of Google Books , 2007, First Monday.

[13]  Kalev Leetaru,et al.  Mass Book Digitization: The Deeper Story of Google Books and the Open Content Alliance , 2008, First Monday.

[14]  Trudi Bellardo Hahn,et al.  Mass Digitization: Implications for Preserving the Scholarly Record , 2008 .

[15]  Stuart E. Madnick,et al.  Overview and Framework for Data and Information Quality Research , 2009, JDIQ.

[16]  Bethany Latham Federal agencies digitization guidelines initiative , 2009 .

[17]  Judith A. Wolfe,et al.  Preservation in the Age of Large‐scale Digitization: A White Paper , 2009 .

[18]  André de Melo Araújo The case for books: past, present and future , 2010 .

[19]  Charles W. Bailey Google Books Bibliography, Version 6 , 2010 .

[20]  Ryan James,et al.  An Assessment of the Legibility of Google Books , 2010 .

[21]  Edgar Jones,et al.  Google Books as a General Research Collection , 2010 .

[22]  Dan Cohen Is Google Good for History , 2010 .

[23]  Paul Conway Archival quality and long-term preservation: a research framework for validating the usefulness of digital surrogates , 2011 .

[24]  S. Mceathron An Assessment of Image Quality in Geology Works from the HathiTrust Digital Library , 2011 .