Reliable access to massive restricted texts: Experience‐based evaluation

Libraries are seeing growing numbers of digitized textual corpora that frequently come with restrictions on their content. Computational analysis corpora that are large, while of interest to scholars, can be cumbersome because of the combination of size, granularity of access, and access restrictions. Efficient management of such a collection for general access especially under failures depends on the primary storage system. In this paper, we identify the requirements of managing for computational analysis a massive text corpus and use it as basis to evaluate candidate storage solutions. The study based on the 5.9 billion page collection of the HathiTrust digital library. Our findings led to the choice of Cassandra 3.x for the primary back end store, which is currently in deployment in the HathiTrust Research Center.

[1]  Vijay V. Raghavan,et al.  NoSQL Systems for Big Data Management , 2014, 2014 IEEE World Congress on Services.

[2]  Sathiamoorthy Manoharan,et al.  A performance comparison of SQL and NoSQL databases , 2013, 2013 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM).

[3]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[4]  Jeremy York Building A Future By Preserving Our Past: The Preservation Infrastructure of HathiTrust Digital Library , 2010 .

[5]  Atul Prakash,et al.  Cloud computing data capsules for non-consumptiveuse of texts , 2014, ScienceCloud '14.

[6]  John Sharp,et al.  Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence , 2013 .

[7]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[8]  Karen Coyle,et al.  Mass Digitization of Books. , 2006 .

[9]  Stefan Jablonski,et al.  NoSQL evaluation: A use case oriented survey , 2011, 2011 International Conference on Cloud and Service Computing.

[10]  David Bamman,et al.  A Bayesian Mixed Effects Model of Literary Character , 2014, ACL.

[11]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[12]  Ian Foster,et al.  Research Infrastructure for the Safe Analysis of Sensitive Data , 2018 .

[13]  Tilmann Rabl,et al.  Solving Big Data Challenges for Enterprise Application Performance Management , 2012, Proc. VLDB Endow..

[14]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[15]  Snigdha Singh,et al.  Benchmarking and Analysis of NoSQL Technologies , 2013 .

[16]  Jorge Bernardino,et al.  Choosing the right NoSQL database for the job: a quality attribute evaluation , 2015, Journal of Big Data.

[17]  Robert H. McDonald,et al.  Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research , 2015, JCDL.

[18]  Ioannis Konstantinou,et al.  On the elasticity of NoSQL databases over cloud management platforms , 2011, CIKM '11.

[19]  Beth Sandore Namachchivaya,et al.  The HathiTrust Research Center: Exploring the Full-Text Frontier. , 2016 .

[20]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[21]  Josiah L. Carlson,et al.  Redis in Action , 2013 .

[22]  Sami Bhiri,et al.  ODBAPI: A Unified REST API for Relational and NoSQL Data Stores , 2014, 2014 IEEE International Congress on Big Data.