Petabytes in Practice: Working with Collections as Data at Scale

Abstract The emerging transdiscipline of Computational Archival Science (CAS) links frameworks such as Brown Dog and repository software such as Digital Repository At Scale To Invite Computation (DRAS-TIC) to yield an understanding of working with digital collections at scale for cultural data. The DRAS-TIC and Brown Dog projects here serve as the basis for an expandable distributed storage/service architecture with on-demand, horizontally scalable integrated digital preservation and analysis services.

[1]  Ruslan Salakhutdinov,et al.  Neural Models for Reasoning over Multiple Mentions Using Coreference , 2018, NAACL.

[2]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[3]  Luigi Marini,et al.  Brown Dog: Making the Digital World a Better Place, a Few Files at a Time , 2018, PEARC.

[4]  Yan Zhao,et al.  Clowder: Open Source Data Management for Long Tail Data , 2018, PEARC.

[5]  Richard Marciano,et al.  Designing Scalable Cyberinfrastructure for Metadata Extraction in Billion-Record Archives , 2016, iPRES.

[6]  Benjamin C. Pierce,et al.  Relational lenses: a language for updatable views , 2006, PODS '06.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  John Bradley,et al.  Factoid-based prosopography and computer ontologies: towards an integrated approach , 2015, Digit. Scholarsh. Humanit..

[9]  Larry W. Isaac,et al.  How the Civil Rights Movement REVITALIZED LABOR MILITANCY , 2002, American Sociological Review.

[10]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[11]  Nachum Dershowitz,et al.  OCR Error Correction Using Character Correction and Feature-Based Word Classification , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[12]  Jeff Magee,et al.  Self-Managed Systems: an Architectural Challenge , 2007, Future of Software Engineering (FOSE '07).

[13]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15]  Reynold Xin,et al.  Apache Spark , 2016 .

[16]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[17]  Daniel L. Schwartz,et al.  The Productive Agency that Drives Collaborative Learning , 1998 .

[18]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[19]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[20]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[21]  Makoto Miwa,et al.  End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures , 2016, ACL.

[22]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[23]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[24]  Rui Liu,et al.  Brown Dog: Leveraging everything towards autocuration , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[25]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[26]  William Underwood,et al.  Archival Records and Training in the Age of Big Data , 2018 .

[27]  Heiko Paulheim,et al.  RDF2Vec: RDF Graph Embeddings for Data Mining , 2016, SEMWEB.

[28]  Jacquelyn Dowd Hall,et al.  The Long Civil Rights Movement and the Political Uses of the Past , 2005 .

[29]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[30]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[31]  Michael I. Jordan,et al.  SparkNet: Training Deep Networks in Spark , 2015, ICLR.

[32]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[33]  Chuan Wang,et al.  Look, Listen and Learn - A Multimodal LSTM for Speaker Identification , 2016, AAAI.

[34]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[35]  Reagan Moore,et al.  Preservation of digital data with self-validating, self-instantiating knowledge-based archives , 2001, SGMD.

[36]  Sebastian Hellmann,et al.  Real-Time RDF Extraction from Unstructured Data Streams , 2013, SEMWEB.