To increase trust, change the social design behind aggregated biodiversity data

Abstract Growing concerns about the quality of aggregated biodiversity data are lowering trust in large-scale data networks. Aggregators frequently respond to quality concerns by recommending that biologists work with original data providers to correct errors ‘at the source.’ We show that this strategy falls systematically short of a full diagnosis of the underlying causes of distrust. In particular, trust in an aggregator is not just a feature of the data signal quality provided by the sources to the aggregator, but also a consequence of the social design of the aggregation process and the resulting power balance between individual data contributors and aggregators. The latter have created an accountability gap by downplaying the authorship and significance of the taxonomic hierarchies—frequently called ‘backbones’—they generate, and which are in effect novel classification theories that operate at the core of data-structuring process. The Darwin Core standard for sharing occurrence records plays an under-appreciated role in maintaining the accountability gap, because this standard lacks the syntactic structure needed to preserve the taxonomic coherence of data packages submitted for aggregation, potentially leading to inferences that no individual source would support. Since high-quality data packages can mirror competing and conflicting classifications, i.e. unsettled systematic research, this plurality must be accommodated in the design of biodiversity data integration. Looking forward, a key directive is to develop new technical pathways and social incentives for experts to contribute directly to the validation of taxonomically coherent data packages as part of a greater, trustworthy aggregation process.

[1]  Arturo H. Ariño,et al.  On the dates of GBIF mobilised primary biodiversity records , 2013 .

[2]  Yohay Carmel,et al.  Quantifying the value of user-level data cleaning for big data: A case study using mammal distribution models , 2016, Ecol. Informatics.

[3]  M. Berg The Politics of Technology: On Bringing Social Theory into Technological Design , 1998 .

[4]  John La Salle,et al.  A specialist’s audit of aggregated occurrence records: An ‘aggregator’s’ perspective , 2013, ZooKeys.

[5]  Yair Zick,et al.  Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[6]  Rino,et al.  BRIDGING BIODIVERSITY DATA GAPS : RECOMMENDATIONS TO MEET USERS ’ DATA NEEDS , 2013 .

[7]  J. G. Burleigh,et al.  Synthesis of phylogeny and taxonomy into a comprehensive tree of life , 2014, Proceedings of the National Academy of Sciences.

[8]  Amy,et al.  CONTENT ASSESSMENT OF THE PRIMARY BIODIVERSITY DATA PUBLISHED THROUGH GBIF NETWORK : STATUS , CHALLENGES AND POTENTIALS , 2013 .

[9]  Tim Sutton,et al.  How Global Is the Global Biodiversity Information Facility? , 2007, PloS one.

[10]  Karen Cranston,et al.  Automated assembly of a reference taxonomy for phylogenetic data synthesis , 2017, bioRxiv.

[11]  Michael Kuhlmann,et al.  The taxonomist - an endangered race. A practical proposal for its survival , 2011, Frontiers in Zoology.

[12]  B. Vanhoorne,et al.  World Register of Marine Species , 2013 .

[13]  Guanyang Zhang,et al.  A taxonomic monograph of the assassin bug genus Zelus Fabricius (Hemiptera: Reduviidae): 71 species based on 10,000 specimens , 2016, Biodiversity data journal.

[14]  M J Scoble,et al.  The web and the structure of taxonomy. , 2007, Systematic biology.

[15]  B. Strasser The Experimenter's Museum: GenBank, Natural History, and the Moral Economies of Biomedicine , 2011, Isis.

[16]  Bertram Ludäscher,et al.  Semantic Annotation of Mutable Data , 2013, PloS one.

[17]  Malcolm J Scoble,et al.  Unitary or unified taxonomy? , 2004, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[18]  Susann Wagenknecht,et al.  A Social Epistemology of Research Groups , 2017 .

[19]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[20]  Lee Belbin,et al.  Towards a national bio-environmental data facility: experiences from the Atlas of Living Australia , 2016, Int. J. Geogr. Inf. Sci..

[21]  Gaurav Vaidya,et al.  Avibase – a database system for managing and organizing taxonomic concepts , 2014, ZooKeys.

[22]  Lyubomir Penev,et al.  The Open Biodiversity Knowledge Management System in Scholarly Publishing , 2016 .

[23]  F. A. Bisby,et al.  The Catalogue of Life: towards an integrative taxonomic backbone for biodiversity , 2010 .

[24]  Walter G. Berendsohn,et al.  The concept of "potential taxa" in databases , 1995 .

[25]  Bertram Ludäscher,et al.  Names are not good enough: Reasoning over taxonomic change in the Andropogon complex , 2016, Semantic Web.

[26]  Helen De Cruz,et al.  The value of epistemic disagreement in scientific practice. The case of Homo floresiensis , 2013 .

[27]  Nico M. Franz,et al.  5 On the Use of Taxonomic Concepts in Support of Biodiversity Research and Taxonomy , 2006 .

[28]  H. Godfray Challenges for taxonomy , 2002, Nature.

[29]  Robert P. Guralnick,et al.  A Standardized Reference Data Set for Vertebrate Taxon Name Resolution , 2016, PloS one.

[30]  F. Por,et al.  A "taxonomic affidavit": Why it is needed? , 2007, Integrative zoology.

[31]  A. Townsend Peterson,et al.  Alternate Species Concepts as Bases for Determining Priority Conservation Areas , 1999 .

[32]  S. J. Graves,et al.  Mapping the biosphere: exploring species to understand the origin, organization and sustainability of biodiversity , 2012 .

[33]  A. Relman,et al.  The Role of Trust in Knowledge , 2007 .

[34]  J. Edwards Research and Societal Benefits of the Global Biodiversity Information Facility , 2004 .

[35]  Ashley Mcdowell,et al.  Trust and information: The role of trust in the social epistemology of information science , 2002 .

[36]  A. Bortolus,et al.  Error Cascades in the Biological Sciences: The Unwanted Consequences of Using Bad Taxonomy in Ecology , 2008, Ambio.

[37]  Ellinor Michel,et al.  Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond , 2016, ZooKeys.

[38]  Snejana Moncheva,et al.  PESI - a taxonomic backbone for Europe , 2015, Biodiversity Data Journal.

[39]  David Gefen,et al.  The Dual Role of Trust in System Use , 2013, J. Comput. Inf. Syst..

[40]  Susan K. Wiser,et al.  Achievements and challenges in the integration, reuse and synthesis of vegetation plot data , 2016 .

[41]  Roderic D. M. Page,et al.  Biodiversity informatics: the challenge of linking data and the role of shared identifiers , 2008, Briefings Bioinform..

[42]  Paul Dourish,et al.  Process descriptions as organisational accounting devices: the dual use of workflow technologies , 2001, GROUP.

[43]  F. Bisby The quiet revolution: biodiversity informatics and the internet. , 2000, Science.

[44]  I. Kitching,et al.  Online solutions and the ‘Wallacean shortfall’: what does GBIF contribute to our knowledge of species' ranges? , 2013 .

[45]  Beth Baker New Push to Bring US Biological Collections to the World's Online Community , 2011 .

[46]  Walter Jetz,et al.  Mapping the biodiversity of tropical insects: species richness and inventory completeness of African sphingid moths , 2013 .

[47]  Jennifer Jill Fellows,et al.  Trust without Shared Belief: Pluralist Realism and Polar Bear Conservation , 2017, Perspectives on Science.

[48]  S. Takagi,et al.  Natural History , 2019, Nature.

[49]  Kyle Copas,et al.  On natural history collections, digitized and not: a response to Ferro and Flick , 2016, ZooKeys.

[50]  Usda Nrcs The PLANTS Database , 2015 .

[51]  R. Mittermeier,et al.  Primate taxonomy: Species and conservation , 2014, Evolutionary anthropology.

[52]  Karen Cranston,et al.  Phylesystem: a git-based data store for community-curated phylogenetic estimates , 2015, Bioinform..

[53]  Bertram Ludäscher,et al.  Two Influential Primate Classifications Logically Aligned , 2016, Systematic biology.

[54]  Christine L Borgman,et al.  Science friction: Data, metadata, and collaboration , 2011, Social studies of science.

[55]  Eric Winsberg,et al.  Accountability and values in radically collaborative research. , 2014, Studies in history and philosophy of science.

[56]  David Remsen,et al.  The use and limits of scientific names in biological informatics , 2016, ZooKeys.

[57]  Vincent S. Smith,et al.  No specimen left behind: industrial scale digitization of natural history collections , 2012, ZooKeys.

[58]  Nico M. Franz,et al.  Controlling the taxonomic variable: Taxonomic concept resolution for a southeastern United States herbarium portal , 2016 .

[59]  Nico M. Franz,et al.  Taxonomy for Humans or Computers? Cognitive Pragmatics for Big Data , 2017 .

[60]  Robert Lücking,et al.  From GenBank to GBIF: Phylogeny-Based Predictive Niche Modeling Tests Accuracy of Taxonomic Identifications in Large Occurrence Data Repositories , 2016, PloS one.

[61]  Corinna Gries,et al.  Symbiota – A virtual platform for creating voucher-based biodiversity information communities , 2014, Biodiversity data journal.

[62]  Nico M. Franz,et al.  BIOLOGICAL TAXONOMY AND ONTOLOGY DEVELOPMENT: SCOPE AND LIMITATIONS , 2010 .

[63]  Helen M. Regan,et al.  Big data for forecasting the impacts of global change on plant communities , 2017 .

[64]  D. Sperber,et al.  Epistemic Vigilance , 2010 .

[65]  John Wieczorek,et al.  Darwin Core: An Evolving Community-Developed Biodiversity Data Standard , 2012, PloS one.

[66]  Andrew J. Flick,et al.  “Collection Bias” and the Importance of Natural History Collections in Species Habitat Modeling: A Case Study Using Thoracophorus costalis Erichson (Coleoptera: Staphylinidae: Osoriinae), with a Critique of GBIF.org , 2015 .

[67]  Robert Mesibov,et al.  A specialist’s audit of aggregated occurrence records , 2013, ZooKeys.

[68]  Nico Cellinese,et al.  Evolutionary informatics: unifying knowledge about the diversity of life. , 2012, Trends in ecology & evolution.

[69]  Albert E. Radford,et al.  Manual of the Vascular Flora of the Carolinas , 1970 .

[70]  Sabina Leonelli,et al.  Classificatory Theory in Biology , 2013 .

[71]  Alexandre Antonelli,et al.  Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases? , 2015, Global ecology and biogeography : a journal of macroecology.

[72]  Neil D. Burgess,et al.  Red List assessments of East African chameleons: a case study of why we need experts , 2014, Oryx.

[73]  Arturo H. Ariño,et al.  CONTENT ASSESSMENT OF THE PRIMARY BIODIVERSITY DATA PUBLISHED THROUGH GBIF NETWORK: STATUS, CHALLENGES AND POTENTIALS , 2013 .

[74]  Walter Jetz,et al.  Global priorities for an effective information basis of biodiversity distributions , 2015, Nature Communications.

[75]  Walter Jetz,et al.  Integrating biodiversity distribution knowledge: toward a global map of life. , 2012, Trends in ecology & evolution.

[76]  Sabina Leonelli,et al.  Data-Centric Biology: A Philosophical Study , 2016 .

[77]  A. Peterson,et al.  New developments in museum-based informatics and applications in biodiversity analysis. , 2004, Trends in ecology & evolution.

[78]  M. Fricker FORUM: Miranda FRICKER's Epistemic Injustice. Power and the Ethics of Knowing , 2008, THEORIA.

[79]  Nicolas Bailly,et al.  A Higher Level Classification of All Living Organisms , 2015, PloS one.

[80]  Maureen A. O’Malley,et al.  When integration fails: Prokaryote phylogeny and the tree of life. , 2013, Studies in history and philosophy of biological and biomedical sciences.

[81]  Brian C. Thomas,et al.  A new view of the tree of life , 2016, Nature Microbiology.

[82]  Gabriel Valiente,et al.  An edit script for taxonomic classifications , 2005, BMC Bioinformatics.

[83]  B. Strasser,et al.  Collecting, Comparing, and Computing Sequences: The Making of Margaret O. Dayhoff’s Atlas of Protein Sequence and Structure, 1954–1965 , 2010, Journal of the history of biology.

[84]  Jorge Soberón,et al.  A global perspective on decadal challenges and priorities in biodiversity informatics , 2015, BMC Ecology.

[85]  J Kennedy,et al.  Standard data model representation for taxonomic information. , 2006, Omics : a journal of integrative biology.

[86]  Leen Vandepitte,et al.  How Aphia - the platform behind several online and taxonomically oriented databases - can serve both the taxonomic community and the field of biodiversity informatics , 2015 .

[87]  Arturo H. Ariño,et al.  Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF) , 2013, PloS one.

[88]  Ning Wang,et al.  The development of scientific consensus: Analyzing conflict and concordance among avian phylogenies. , 2017, Molecular phylogenetics and evolution.

[89]  Benjamin D. Redelings,et al.  A supertree pipeline for summarizing phylogenetic and taxonomic information for millions of species , 2017, PeerJ.

[90]  Shanan E. Peters,et al.  The Paleobiology Database application programming interface , 2015, Paleobiology.

[91]  Torsten Dikow,et al.  Beyond dead trees: integrating the scientific process in the Biodiversity Data Journal , 2013, Biodiversity data journal.

[92]  Julien Gaffuri,et al.  Mapping ignorance: 300 years of collecting flowering plants in Africa , 2016 .

[93]  Geoffrey C. Bowker Biodiversity Datadiversity , 2000 .

[94]  Jorge M. Lobo,et al.  Can we derive macroecological patterns from primary Global Biodiversity Information Facility data , 2015 .

[95]  Geoffrey C. Bowker,et al.  Making an Issue out of a Standard , 2013 .

[96]  Martin Carrier,et al.  Scientific Knowledge and Scientific Expertise: Epistemic and Social Conditions of Their Trustworthiness , 2010 .