Hyphe, a Curation-Oriented Approach to Web Crawling for the Social Sciences

The web is a field of investigation for social sciences, and platform-based studies have long proven their relevance. However the generic web is rarely studied in itself though it contains crucial aspects of the embodiment of social actors: personal blogs, institutional websites, hobby-specific media… We realized that some sociologists see existing web crawlers as “black boxes” unsuitable for research though they are willing to study the broad web. In this paper we present Hyphe, a crawler developed with and for social scientists, with an innovative “curation-oriented” approach. We expose the problems of using web-mining techniques in social science research and how to overcome those by specific features such as step-by-step corpus building and a memory structure allowing researchers to redefine dynamically the granularity of their “web entities”.

[1]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[2]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[3]  Roy T. Fielding,et al.  Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.

[4]  David Mason,et al.  Digital Methods , 2014, Online Inf. Rev..

[5]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[6]  Robert Ackland,et al.  Web Social Science: Concepts, Data and Tools for Social Scientists in the Digital Age , 2013 .

[7]  Roy T. Fielding,et al.  Uniform Resource Identifier (URI): Generic Syntax , 2005, RFC.

[8]  Mike Thelwall,et al.  Introduction to Webometrics: Quantitative Web Research for the Social Sciences , 2009, Introduction to Webometrics.

[9]  Richard Rogers Mapping public web space with the Issuecrawler , 2010 .

[10]  Cynthia Pedroja,et al.  Dépasser La Liste : Quand La Bibliothèque Entre Dans La Danse Des Corpus Web , 2016, DH.

[11]  e-Diasporas Atlas Exploration and Cartography of Diasporas in Digital Networks , 2012 .

[12]  M. Jacomy,et al.  ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software , 2014, PloS one.

[13]  David M. Pennock,et al.  The structure of broad topics on the web , 2002, WWW.

[14]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[15]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[16]  Russell K. Standish,et al.  VOSON: A Web Services Approach for Facilitating Research into Online Networks , 2006 .