‘Sciencenet’—towards a global search and share engine for all scientific knowledge

Summary: Modern biological experiments create vast amounts of data which are geographically distributed. These datasets consist of petabytes of raw data and billions of documents. Yet to the best of our knowledge, a search engine technology that searches and cross-links all different data types in life sciences does not exist. We have developed a prototype distributed scientific search engine technology, ‘Sciencenet’, which facilitates rapid searching over this large data space. By ‘bringing the search engine to the data’, we do not require server farms. This platform also allows users to contribute to the search index and publish their large-scale data to support e-Science. Furthermore, a community-driven method guarantees that only scientific content is crawled and presented. Our peer-to-peer approach is sufficiently scalable for the science web without performance or capacity tradeoff. Availability and Implementation: The free to use search portal web page and the downloadable client are accessible at: http://sciencenet.kit.edu. The web portal for index administration is implemented in ASP.NET, the ‘AskMe’ experiment publisher is written in Python 2.7, and the backend ‘YaCy’ search engine is based on Java 1.6. Contact: urban.liebel@kit.edu Supplementary Material: Detailed instructions and descriptions can be found on the project homepage: http://sciencenet.kit.edu.

[1]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[2]  Matthew E Falagas,et al.  Comparison of PubMed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses , 2007, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[3]  Gavin Sherlock,et al.  Funding high-throughput data sharing , 2004, Nature Biotechnology.

[4]  David R. Karger,et al.  Looking up data in P2P systems , 2003, CACM.

[5]  Data's shameful neglect. , 2009, Nature.

[6]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[7]  Daniel Rios,et al.  Ensembl 2011 , 2010, Nucleic Acids Res..

[8]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[9]  J. Ellenberg,et al.  High-throughput fluorescence microscopy for systems biology , 2006, Nature Reviews Molecular Cell Biology.

[10]  Carl Lagoze,et al.  The Open Archives Initiative Protocol for Metadata Harvesting Protocol , 2002 .

[11]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[12]  Peer Bork,et al.  Ontologies in Quantitative Biology: A Basis for Comparison, Integration, and Discovery , 2010, PLoS biology.

[13]  R. Aebersold,et al.  Applying mass spectrometry-based proteomics to genetics, genomics and network biology , 2009, Nature Reviews Genetics.

[14]  Kevin W. Eliceiri,et al.  Open source bioimage informatics for cell biology , 2009, Trends in cell biology.

[15]  Rodrigo Lopez,et al.  Fast and efficient searching of biological data resources - using EB-eye , 2010, Briefings Bioinform..

[16]  W. Ansorge Next-generation DNA sequencing techniques. , 2009, New biotechnology.

[17]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[18]  Urban Liebel,et al.  ??Harvester??: a fast meta search engine of human protein resources , 2004, Bioinform..

[19]  Dirk Lewandowski,et al.  Exploring the academic invisible web , 2006, Libr. Hi Tech.

[20]  Tim Berners-Lee,et al.  Linked data , 2020, Semantic Web for the Working Ontologist.

[21]  C. Lagoze,et al.  The making of the Open Archives Initiative Protocol for Metadata Harvesting , 2003 .

[22]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[23]  Gerry McKiernan E‐profile: Scirus1: For Scientific Information Only , 2005 .