Micro Archives as Rich Digital Object Representations

Digital objects as well as real-world entities are commonly referred to in literature or on the Web by mentioning their name, linking to their website or citing unique identifiers, such as DOI and ORCID, which are backed by a set of meta information. All of these methods have severe disadvantages and are not always suitable though: They are not very precise, not guaranteed to be persistent or mean a big additional effort for the author, who needs to collect the metadata to describe the reference accurately. Especially for complex, evolving entities and objects like software, pre-defined metadata schemas are often not expressive enough to capture its temporal state comprehensively. We found in previous work that a lot of meaningful information about software, such as a description, rich metadata, its documentation and source code, is usually available online. However, all of this needs to be preserved coherently in order to constitute a rich digital representation of the entity. We show that this is currently not the case, as only 10% of the studied blog posts and roughly 30% of the analyzed software websites are archived completely, i.e., all linked resources are captured as well. Therefore, we propose Micro Archives as rich digital object representations, which semantically and logically connect archived resources and ensure a coherent state. With Micrawler we present a modular solution to create, cite and analyze such Micro Archives. In this paper, we show the need for this approach as well as discuss opportunities and implications for various applications also beyond scholarly writing.

[1]  H. Van de Sompel,et al.  Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot , 2014, PloS one.

[2]  Michael L. Nelson,et al.  How much of the web is archived? , 2011, JCDL '11.

[3]  Roi Blanco,et al.  Temporal Information Retrieval , 2015, Found. Trends Inf. Retr..

[4]  Carole A. Goble,et al.  The Software Sustainability Institute: Changing Research Software Attitudes and Practices , 2013, Computing in Science & Engineering.

[5]  Wolfram Sperber,et al.  Archiving Software Surrogates on the Web for Future Reference , 2016, TPDL.

[6]  Wolfgang Nejdl,et al.  Exploring Web Archives Through Temporal Anchor Texts , 2017, WebSci.

[7]  Avishek Anand,et al.  ArchiveSpark: Efficient Web archive access, extraction and derivation , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[8]  Nikos Kasioumis,et al.  Towards building a blog preservation platform , 2014, World Wide Web.

[9]  Catherine C. Marshall,et al.  Rethinking the web as a personal archive , 2013, WWW.

[10]  Miguel Costa,et al.  Learning temporal-dependent ranking models , 2014, SIGIR.

[11]  Andrew E. Treloar,et al.  The Research Data Alliance: globally co‐ordinated action against barriers to data publishing and sharing , 2014, Learn. Publ..

[12]  Emi Ishita,et al.  Life span of web pages: A survey of 10 million pages collected in 2001 , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[13]  Wolfgang Nejdl,et al.  The Dawn of today's popular domains: A study of the archived German Web over 18 years , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[14]  Matthias Razum,et al.  Establishing a generic Research Data Repository , 2016, iPRES.

[15]  Roberto Di Cosmo,et al.  Software Heritage: Why and How to Preserve Software Source Code , 2017, iPRES.

[16]  Miguel Costa,et al.  A survey of web archive search architectures , 2013, WWW.

[17]  Organización Internacional de Normalización ISO 26324 : Information and documentation -- Digital object identifier system , 2012 .

[18]  Frank M. Shipman,et al.  On the institutional archiving of social media , 2012, JCDL '12.

[19]  Andrea Porzel,et al.  The RADAR Project - A Service for Research Data Archival and Publication , 2016, ISPRS Int. J. Geo Inf..

[20]  Michael Gertz,et al.  Temporal Information Retrieval , 2009, Encyclopedia of Database Systems.

[21]  Claire Grover,et al.  No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving , 2015, JCDL.

[22]  Ian M. Mitchell,et al.  Best Practices for Scientific Computing , 2012, PLoS biology.

[23]  Heather A. Piwowar,et al.  Sharing Detailed Research Data Is Associated with Increased Citation Rate , 2007, PloS one.