Methods and Approaches to Using Web Archives in Computational Communication Research

ABSTRACT This article examines the role of web archives as a critical source of data for conducting computational communication research. Web archives are large-scale databases containing comprehensive records of websites showing how those websites have evolved over time. Recent communication scholarship using web archives is reviewed, demonstrating the breadth of research conducted in this space. Subsequently, a methodological framework is proposed for using web archives in computational communication research. As a source of data, web archives present a number of methodological challenges, particularly with regards to the accuracy and completeness of web archives. These problems are addressed in order to better inform future work in this area. The closing sections outline a forward-looking trajectory for computational communication research using web archives.

[1]  David Stark,et al.  Link, Search, Interact , 2004 .

[2]  Kirsten A. Foot,et al.  Web Campaigning (Acting with Technology) , 2006 .

[3]  Jimmy J. Lin,et al.  Infrastructure for supporting exploration and discovery in web archives , 2014, WWW '14 Companion.

[4]  Michele Kimpton,et al.  Year-by-Year: From an Archive of the Internet to an Archive on the Internet , 2006 .

[5]  A. Reyes Linguistic Anthropology in 2013: Super‐New‐Big , 2014 .

[6]  Michael J. Day,et al.  Preserving the Fabric of Our Lives: A Survey of Web , 2003, ECDL.

[7]  Jimmy Lin,et al.  Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives , 2017, JOCCH.

[8]  Gustavo S. Mesch,et al.  Digital inequalities and why they matter , 2015 .

[9]  Ian Milligan Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives , 2016, Int. J. Humanit. Arts Comput..

[10]  W. J. Elliott,et al.  Effectiveness of Home Blood Pressure Monitoring, Web Communication, and Pharmacist Care on Hypertension Control: A Randomized Controlled Trial , 2009 .

[11]  Lada A. Adamic,et al.  Computational Social Science , 2009, Science.

[12]  Avishek Anand,et al.  ArchiveSpark: Efficient Web archive access, extraction and derivation , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[13]  Helen Hockx-Yu,et al.  Access and Scholarly Use of Web Archives , 2014 .

[14]  Matthew S. Weber,et al.  Newspapers and the Long-Term Implications of Hyperlinking , 2012, J. Comput. Mediat. Commun..

[15]  M. Skolnick,et al.  The Mormon historical demography project. , 1978, Historical methods.

[16]  Jonathon N. Cummings,et al.  Internet Paradox Revisited , 2002 .

[17]  A. Arvidsson,et al.  Echo Chamber or Public Sphere? Predicting Political Orientation and Measuring Political Homophily in Twitter Using Big Data , 2014 .

[18]  L. Manovich,et al.  Trending: The Promises and the Challenges of Big Social Data , 2012 .

[19]  B. Young,et al.  Imputation of missing data in life‐history trait datasets: which approach performs the best? , 2014 .

[20]  Michael L. Nelson,et al.  Not all mementos are created equal: measuring the impact of missing resources , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[21]  P. Greenfield,et al.  Cultural evolution over the last 40 years in China: using the Google Ngram Viewer to study implications of social and political change for cultural values. , 2015, International journal of psychology : Journal international de psychologie.

[22]  Michael L. Nelson,et al.  How much of the web is archived? , 2011, JCDL '11.

[23]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[24]  D. Ruths,et al.  Social media for large studies of behavior , 2014, Science.

[25]  Peter Johan Lor,et al.  Everything, for ever? The preservation of South African websites for future research and scholarship , 2006, J. Inf. Sci..

[26]  Jinfang Niu,et al.  An Overview of Web Archiving , 2012, D Lib Mag..

[27]  Hsinchun Chen,et al.  Analyzing terror campaigns on the internet: Technical sophistication, content richness, and Web interactivity , 2007, Int. J. Hum. Comput. Stud..

[28]  Edward A. Fox,et al.  Social media use by government: from the routine to the critical , 2011, dg.o '11.

[29]  Joseph N Cappella,et al.  Vectors into the Future of Mass and Interpersonal Communication Research: Big Data, Social Media, and Computational Social Science. , 2017, Human communication research.

[30]  Jaideep Srivastava,et al.  First 20 precision among World Wide Web search services (search engines) , 1999 .

[31]  Michelle Shumate,et al.  The Evolution of the HIV/AIDS NGO Hyperlink Network , 2012, J. Comput. Mediat. Commun..

[32]  John E. Newhagen,et al.  Why Communication Researchers Should Study the Internet: A Dialogue , 1996, J. Comput. Mediat. Commun..

[33]  NIELS BRÜGGER,et al.  Website history and the website as an object of study , 2009, New Media Soc..

[34]  Hai Nguyen,et al.  Big Data?: Big Issues Degradation in Longitudinal Data and Implications for Social Sciences , 2015, WebSci.

[35]  Miguel Costa,et al.  A Survey on Web Archiving Initiatives , 2011, TPDL.

[36]  Paul Lindner Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? , 2016 .

[37]  Niels Brügger,et al.  Introduction: The Web’s first 25 years , 2016, New Media Soc..

[38]  Kevin Driscoll,et al.  Big Data, Big Questions| Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data , 2014 .

[39]  Brooke Foucault Welles,et al.  Individual Motivations and Network Effects , 2015 .

[40]  Wael Khreich,et al.  A Survey of Techniques for Event Detection in Twitter , 2015, Comput. Intell..

[41]  Jimmy J. Lin,et al.  Content selection and curation for web archiving: The gatekeepers vs. the masses , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[42]  Michael L. Nelson,et al.  Who and what links to the Internet Archive , 2014, International Journal on Digital Libraries.

[43]  Hye-Won Lee,et al.  Development of Metadata Elements for Intensive Web Archiving , 2007 .

[44]  Chuanmin Mi,et al.  A new method for evaluating tour online review based on grey 2-tuple linguistic , 2014, Kybernetes.

[45]  Sean Aday,et al.  Online Fragmentation in Wartime , 2015 .

[46]  Kevin Crowston,et al.  Reproduced and Emergent Genres of Communication on the World Wide Web , 2000, Inf. Soc..

[47]  Christian Kelleher,et al.  The Human Rights Documentation Initiative at the University of Texas Libraries , 2010 .

[48]  Dhavan V. Shah,et al.  Candidate Networks, Citizen Clusters, and Political Expression , 2015 .

[49]  Panos Constantopoulos,et al.  Research and Advanced Technology for Digital Libraries , 2001, Lecture Notes in Computer Science.

[50]  Ian S. Lustick History, Historiography, and Political Science: Multiple Historical Records and the Problem of Selection Bias , 1996, American Political Science Review.

[51]  Arthur Thomas,et al.  Researcher Engagement with Web Archives: State of the Art , 2010 .

[52]  Jessica Ogden,et al.  Interrogating the politics and performativity of web archives , 2016, JCDL 2016.

[53]  Kirsten A. Foot,et al.  Web-Based Memorializing After September 11: Toward a Conceptual Framework , 2005, J. Comput. Mediat. Commun..

[54]  Stine Lomborg,et al.  Researching Communicative Practice: Web Archiving in Qualitative Social Media Research , 2012 .

[55]  Michael D. Gordon,et al.  Finding Information on the World Wide Web: The Retrieval Effectiveness of Search Engines , 1999, Inf. Process. Manag..

[56]  Gwyn Price The World Wide Web and the Historian , 1995, Hist. Comput..

[57]  Serge Abiteboul,et al.  A First Experience in Archiving the French Web , 2002, ECDL.

[58]  Lisa V. Chewning,et al.  Organizational Resilience and Using Information and Communication Technologies to Rebuild Communication Structures , 2013 .

[59]  Shawn Walker,et al.  A Model of Crowd Enabled Organization: Theory and Methods for Understanding the Role of Twitter in the Occupy Protests , 2014 .

[60]  D. Boyd,et al.  CRITICAL QUESTIONS FOR BIG DATA , 2012 .

[61]  Julien Masanés Web Archiving: Issues and Methods , 2006 .

[62]  Michael L. Nelson,et al.  Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? , 2012, TPDL.

[63]  Malcolm R. Parks Big Data in Communication Research: Its Contents and Discontents , 2014 .

[64]  Marya L. Doerfel,et al.  Building Interorganizational Relationships That Build Nations , 2003 .

[65]  William Sims Bainbridge,et al.  Sociology on the World Wide Web , 1995 .

[66]  Robert H. McDonald,et al.  Measuring and comparing participation patterns in digital repositories: repositories by the numbers part 1 , 2007 .

[67]  Roger Burrows,et al.  Sociology and, of and in Web 2.0: Some Initial Considerations , 2007 .

[68]  M. Stucchi,et al.  Assessing the completeness of Italian historical earthquake data , 2004 .

[69]  M. S. Weber,et al.  Imitation in the Quest to Survive: Lessons from News Media on the Early Web , 2017 .

[70]  Dhavan V. Shah,et al.  Big Data, Digital Media, and Computational Social Science , 2015 .

[71]  Richard Anderson The Future of Preserving the Past: Defining the Value (and Values) of Historic Preservation , 2019 .

[72]  Loet Leydesdorff,et al.  Internet time and the reliability of search engines , 2004, First Monday.

[73]  Xigen Li,et al.  The Evolution of Online Newspapers: A Longitudinal Content Analysis, 1997–2003 , 2013 .

[74]  Niels Brügger,et al.  Historical Network Analysis of the Web , 2013 .

[75]  Lada A. Adamic,et al.  The political blogosphere and the 2004 U.S. election: divided they blog , 2005, LinkKDD '05.

[76]  Hsinchun Chen,et al.  US domestic extremist groups on the Web: link and content analysis , 2005, IEEE Intelligent Systems.

[77]  Jen Stevens Mlis and Ma Long-Term Literary E-Zine Stability: Issues and Access in Libraries , 2004 .

[78]  András A. Benczúr,et al.  Web spam filtering in internet archives , 2009, AIRWeb '09.

[79]  Jimmy J. Lin Scaling Down Distributed Infrastructure on Wimpy Machines for Personal Web Archiving , 2015, WWW.

[80]  Peter R. Monge,et al.  Predictors of the International HIV–AIDS INGO Network Over Time , 2005 .

[81]  Ian Milligan,et al.  Understanding computational web archives research methods using research objects , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[82]  Kevin Crowston,et al.  Reproduced and emergent genres of communication on the World-Wide Web , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.

[83]  Clint A. Boyd,et al.  Quantifying historical trends in the completeness of the fossil record and the contributing factors: an example using Aves , 2012, Paleobiology.

[84]  Ralph Schroeder,et al.  The net as a knowledge machine: How the Internet became embedded in research , 2016, New Media Soc..

[85]  J. Shaw Advantages of Starting with Theory , 2017 .

[86]  William Y. Arms,et al.  From Wayback Machine to Yesternet : New Opportunities for Social Science , 2006 .

[87]  Gerhard Weikum,et al.  Data quality in web archiving , 2009, WICOW.

[88]  Axel Bruns,et al.  Methodologies for mapping the political blogosphere: An exploration using the IssueCrawler research tool , 2007, First Monday.

[89]  Tsuyoshi Murata,et al.  Visualizing the structure of Web communities based on data acquired from a search engine , 2003, IEEE Trans. Ind. Electron..

[90]  David Topps,et al.  YouTube as a Platform for Publishing Clinical Skills Training Videos , 2013, Academic medicine : journal of the Association of American Medical Colleges.