Corpus Protocols: digital transformations of commercial newspaper collections for text and data mining to support academic research

This paper reports on outcomes from the Corpus Protocols project that investigated opportunities and challenges for data and text mining commercial news content, including licensing, storage and accessibility of data, and communication between researchers and publishers. Related research in digital humanities and digital libraries has concentrated on methodologies, overviews and accessibility. This paper focuses on research in corpus linguistics based on digital news data, including the Declassified Documents Reference System and the Times Digital Archive. Modernising Copyright and published draft regulations for UK legislation indicate new opportunities for libraries to support research using digital newspapers. Some publishers have pledged to bring more content online, and others are exploring sustainable commercial models for widening access to big data. This paper explores issues of research data management, innovations in web technologies for big data, and how research based on this kind of data satisfies requirements for the security of commercial news data in the context of emerging legislation. We identify potential conflicts this raises for research libraries and researchers.

[1]  A. Partington Modern Diachronic Corpus-Assisted Discourse Studies (MD-CADS) on UK newspapers: an overview of the project , 2010 .

[2]  Matthew Brook O'donnell,,et al.  Exploring text-initial words, clusters and concgrams in a newspaper corpus , 2012 .

[3]  I. Hargreaves Digital opportunity: A review of intellectual property and growth for HMG , 2011 .

[4]  Debora Cheney Text mining newspapers and news content: new trends and research methodologies , 2013 .

[5]  Andres Guadamuz,et al.  Data Mining in UK Higher Education Institutions: Law and Policy , 2014 .

[6]  Nick Cullather,et al.  The Hungry World: America's Cold War Battle against Poverty in Asia , 2010 .

[7]  Hendrik Spruyt,et al.  Contracting States: Sovereign Transfers in International Relations , 2009 .

[8]  Andres Guadamuz,et al.  Analysis of UK/EU Law on Data Mining in Higher Education Institutions , 2013 .

[9]  Matthew D. Jones,et al.  After Hiroshima: The United States, Race and Nuclear Weapons in Asia, 1945-1965 , 2010 .

[10]  Mary M. Somerville,et al.  Collaborative Improvements in the Discoverability of Scholarly Content , 2014 .

[11]  R. Khalidi,et al.  Sowing Crisis: The Cold War and American Dominance in the Middle East , 2009 .

[12]  Lars Schoultz,et al.  That Infernal Little Cuban Republic: The United States and the Cuban Revolution , 2009 .

[13]  Michaela Mahlberg,et al.  A fresh view of the structure of hard news stories , 2008 .

[14]  H. Brands,et al.  Latin America’s Cold War , 2010 .

[15]  F. Gavin,et al.  Nuclear Statecraft: History and Strategy in America's Atomic Age , 2012 .

[16]  Edward Miller,et al.  Misalliance: Ngo Dinh Diem, the United States, and the Fate of South Vietnam , 2013 .

[17]  Robert H. Holden,et al.  Armies Without Nations: Public Violence and State Formation in Central America, 1821-1960 , 2004 .

[18]  Hubert Zimmermann,et al.  Money and Security: Troops, Monetary Policy, and West Germany's Relations with the United States and Britain, 1950–1971 , 2002 .

[19]  M. Scott,et al.  Disputed climate science in the media: Do countries matter? , 2014, Public understanding of science.

[20]  Zuoyue Wang In Sputnik's Shadow: The President's Science Advisory Committee and Cold War America , 2008 .

[21]  Ann Okerson Text & data mining - a librarian overview , 2013 .

[22]  Peter J. Diggle,et al.  The peaks and troughs of corpus-based contextual analysis. , 2011 .

[23]  Robert B. Allen Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design , 2015, ArXiv.