Web scraping technologies in an API world

Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.

[1]  Xia Li,et al.  APD2: the updated antimicrobial peptide database and its application in peptide design , 2008, Nucleic Acids Res..

[2]  Michael Y. Galperin,et al.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection , 2011, Nucleic Acids Res..

[3]  Giorgio Valle,et al.  The Gene Ontology project in 2008 , 2007, Nucleic Acids Res..

[4]  Tsviya Olender,et al.  Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE , 2003, Nucleic Acids Res..

[5]  Gautier Koscielny,et al.  Ensembl 2012 , 2011, Nucleic Acids Res..

[6]  C. Burge,et al.  Conserved Seed Pairing, Often Flanked by Adenosines, Indicates that Thousands of Human Genes are MicroRNA Targets , 2005, Cell.

[7]  Florentino Fernández Riverola,et al.  WhichGenes: a web-based tool for gathering, building, storing and exporting gene sets with application in gene set enrichment analysis , 2009, Nucleic Acids Res..

[8]  Jing Wang,et al.  Development of an automated climatic data scraping, filtering and display system , 2010 .

[9]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[10]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[11]  Renzo Kottmann,et al.  Microbiological Common Language (MCL): a standard for electronic information exchange in the Microbial Commons. , 2010, Research in microbiology.

[12]  Florentino Fernández Riverola,et al.  PathJam: a new service for integrating biological pathway information , 2010, J. Integr. Bioinform..

[13]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Y. Zhang,et al.  IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..

[15]  Akira R. Kinjo,et al.  The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows. The DBCLS BioHackathon Consortium* , 2010, J. Biomed. Semant..

[16]  Amy K. Schmid,et al.  The Firegoose: two-way integration of diverse data from different bioinformatics web resources with desktop applications , 2007, BMC Bioinformatics.

[17]  Stijn van Dongen,et al.  miRBase: tools for microRNA genomics , 2007, Nucleic Acids Res..

[18]  David S. Wishart,et al.  Biospider: A Web Server for Automating Metabolome Annotations , 2007, Pacific Symposium on Biocomputing.

[19]  Lyle H. Ungar,et al.  Medpie: an Information Extraction Package for Medical Message Board Posts , 2012, Bioinform..

[20]  Michael Piasecki,et al.  Standardizing Access to Hydrologic Data Repositories through Web Services , 2009, 2009 International Conference on Advanced Geographic Information Systems & Web Services.

[21]  Thomas Wetter,et al.  GlycomeDB – integration of open-access carbohydrate structure databases , 2008, BMC Bioinformatics.

[22]  Kanagasabai Rajaraman,et al.  Ontology-centric integration and navigation of the dengue literature , 2008, J. Biomed. Informatics.

[23]  M. Sormani,et al.  Assessing changes in relapse rates in multiple sclerosis , 2010, Multiple sclerosis.

[24]  Ulrich Mayer,et al.  Protein Information Crawler (PIC): Extensive spidering of multiple protein information resources for large protein sets , 2008, Proteomics.

[25]  Robert D. Finn,et al.  Experience using web services for biological sequence analysis , 2008, Briefings Bioinform..

[26]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[27]  Kenneth H. Buetow,et al.  PID: the Pathway Interaction Database , 2008, Nucleic Acids Res..

[28]  Kevin A. Smith,et al.  The Biomedical Resource Ontology (BRO) to enable resource discovery in clinical and translational research , 2011, J. Biomed. Informatics.

[29]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[30]  Yasunori Yamamoto,et al.  OReFiL: an online resource finder for life sciences , 2007, BMC Bioinformatics.

[31]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[32]  Nicholas E. Day,et al.  Automated analysis and validation of open chemical data , 2009 .

[33]  Lincoln Stein,et al.  Reactome: a database of reactions, pathways and biological processes , 2010, Nucleic Acids Res..

[34]  Todd F. DeLuca,et al.  Genotator: A disease-agnostic tool for genetic annotation of disease , 2010, BMC Medical Genomics.

[35]  Uwe Scholz,et al.  Novel Developments of the MetaCrop Information System for Facilitating Systems Biological Approaches , 2010, J. Integr. Bioinform..

[36]  Chris Sander,et al.  CancerGenes: a gene selection resource for cancer genome projects , 2006, Nucleic Acids Res..

[37]  Zhe Wang,et al.  APD: the Antimicrobial Peptide Database , 2004, Nucleic Acids Res..

[38]  L. Stein Creating a bioinformatics nation , 2002, Nature.

[39]  Shreyas Karnik,et al.  CAMP: a useful resource for research on antimicrobial peptides , 2009, Nucleic Acids Res..

[40]  Michael Piasecki,et al.  Engineering new paths to water data , 2009, Comput. Geosci..

[41]  Robert P. Guralnick,et al.  Distributed Systems and Automated Biodiversity Informatics: Genomic Analysis and Geographic Visualization of Disease Evolution , 2008, BNCOD.